《2024龙蜥大会英特尔分论坛:至强处理器RAS特性介绍-范飞飞.pdf》由会员分享,可在线阅读,更多相关《2024龙蜥大会英特尔分论坛:至强处理器RAS特性介绍-范飞飞.pdf(8页珍藏版)》请在三个皮匠报告上搜索。
1、帮助客户构建稳定可靠的服务器Xeon RAS Features英特尔客户方案技术工程师范飞飞What is RASReliabilityAvailabilityServiceabilitySystem ability to ensure data integrity.System capability to detect errors,correct errors,and report errors.System capability to maintain“service availability”in the presence of system faults.Capability to
2、map out failed units,ability to operate in a degraded mode.System capability to effectively report a failure with precise location of the faulty component to expedite the servicing efforts.Fault Handling(Four Pillars of RAS)Fault ToleranceFault Tolerance1.Avoidance 2.Detection 3.Correction1.Avoidanc
3、e 2.Detection 3.CorrectionFault ManagementFault Management4.Reconfiguration4.ReconfigurationE.g.,Failed DIMM IsolationDiagnosability/Serviceability:Minimizing DowntimeError Logging Through HWFW-based Fault ManagementOSOS-based Fault Managementbased Fault ManagementHW(e.g.,CPU,Memory)FW(Intel)FW(OEM/
4、IBV/BMC)OS/VMMOS/VMMApplicationSystem StackError Signaling /PollingRAS Enabling Through the Whole System (HW+FW+SW)RAS Enabling Through the Whole System (HW+FW+SW)E.g.,Memory SDDCSystem Reliability:Extending the UptimeExtending the UptimeFault Avoidance,Detection,and Correction in HWFault Correction
5、 Through FWFault Recovery Through OSFault Recovery Through OSFault Recovery at App LayerRAS Enabling FrameworkFirmwareMCEcmcelog dameonLinux KernelSoft offline pageAPEI/GHESmemory failuredo_machine_checkmcelogCE over thresholdUCEUCEUser spaceCE page thresholdthreshold_interruptCMCICEUCE:Uncorrected
6、ErrorCE:Corrected ErrorCDC:Corrupt Data ContainmentHardware Platform(*Error*)MSMI CSMICMCIMCEedacEMCA2timerCDCLegacy MCAcecVMKVMQEMU-KVMVMQEMU-KVMMCA RecoveryMCA recovery is an Intel RAS MCA recovery is an Intel RAS feature that involves Silicon feature that involves Silicon Hard