《专题讨论:硬件故障管理在大规模可靠性方面的进展.pdf》由会员分享,可在线阅读,更多相关《专题讨论:硬件故障管理在大规模可靠性方面的进展.pdf(23页珍藏版)》请在三个皮匠报告上搜索。
1、Yogesh VarmaDrew WaltonVilas SridharanCarlos VallinShubhada PugaonkarAnil AgrawalHardware Fault Management-Workstreams Update-HWFM,FMFM,RAS APIHardware Fault Management-Workstreams Update-HWFM,FMFM,RAS APIYogesh VarmaDrew Walton(Microsoft)Vilas Sridharan(AMD)Carlos Vallin(Microsoft)Shubhada Pugaonka
2、r(Intel)Anil Agrawal(Meta)Hardware ManagementHW Fault MGMTPanel on RAS API,FMFM and HWFM Yogesh VarmaCo-Lead OCP:RAS API,HWFM and FMFMVilas SridharanSenior FellowAMDCarlos VallinPrincipal EngineerMicrosoftShubhada PugaonkarPrincipal EngineerIntelAnil AgrawalRAS Lead MetaDrew WaltonPrincipal Engineer
3、MicrosoftSi and Data Center Reliability OCP -Initiatives and FutureYogesh Varma,PhDCo-Lead OCP:RAS API,HWFM and FMFMHardware ManagementHW Fault MGMTFacets of Fleet HW Fault MgmtScalabilityScalabilityVendor AgnosticVendor AgnosticSi AgnosticSi AgnosticObservabilityObservabilityTestabilityTestabilityR
4、AS Feature RolloutRAS Feature RolloutStandardized LoggingStandardized LoggingIn-band and OOB SupportIn-band and OOB SupportServiceabilityServiceabilityDiscoverabilityDiscoverabilityConfigurabilityConfigurabilityDebuggabilityDebuggabilityCalls for a Holistic DC Reliability FrameworkOCP Hardware Fault
5、 Management Today Standard framework for errors classification,logging formats,signaling interface,handling actions and RAS configurationRelated OCP Activities:-Datacenter Debug and Diagnostic-CPU/GPU and System RAS and Manageability-Cloud Infrastructure Management-Si Fault Detection and Mitigation(
6、SDC)Future an OCP Standard End-to-End Data Center Reliability FrameworkIndustry is aligning with OCP Reliability Initiatives join us!Hardware Fault ManagementCreate a standard scalable fault handling framework by collaborating with stakeholders to build a shared knowledge-base and by enhancing exist