1、帮助客户构建稳定可靠的服务器Xeon RAS Features英特尔客户方案技术工程师范飞飞What is RASReliabilityAvailabilityServiceabilitySystem ability to ensure data integrity.System capability to detect errors,correct errors,and report errors.System capability to maintain“service availability”in the presence of system faults.Capability to
2、map out failed units,ability to operate in a degraded mode.System capability to effectively report a failure with precise location of the faulty component to expedite the servicing efforts.Fault Handling(Four Pillars of RAS)Fault ToleranceFault Tolerance1.Avoidance 2.Detection 3.Correction1.Avoidanc
3、e 2.Detection 3.CorrectionFault ManagementFault Management4.Reconfiguration4.ReconfigurationE.g.,Failed DIMM IsolationDiagnosability/Serviceability:Minimizing DowntimeError Logging Through HWFW-based Fault ManagementOSOS-based Fault Managementbased Fault ManagementHW(e.g.,CPU,Memory)FW(Intel)FW(OEM/
4、IBV/BMC)OS/VMMOS/VMMApplicationSystem StackError Signaling /PollingRAS Enabling Through the Whole System (HW+FW+SW)RAS Enabling Through the Whole System (HW+FW+SW)E.g.,Memory SDDCSystem Reliability:Extending the UptimeExtending the UptimeFault Avoidance,Detection,and Correction in HWFault Correction
5、 Through FWFault Recovery Through OSFault Recovery Through OSFault Recovery at App LayerRAS Enabling FrameworkFirmwareMCEcmcelog dameonLinux KernelSoft offline pageAPEI/GHESmemory failuredo_machine_checkmcelogCE over thresholdUCEUCEUser spaceCE page thresholdthreshold_interruptCMCICEUCE:Uncorrected
6、ErrorCE:Corrected ErrorCDC:Corrupt Data ContainmentHardware Platform(*Error*)MSMI CSMICMCIMCEedacEMCA2timerCDCLegacy MCAcecVMKVMQEMU-KVMVMQEMU-KVMMCA RecoveryMCA recovery is an Intel RAS MCA recovery is an Intel RAS feature that involves Silicon feature that involves Silicon Hard