《PCIEEX~1.PDF》由会员分享,可在线阅读,更多相关《PCIEEX~1.PDF(12页珍藏版)》请在三个皮匠报告上搜索。
1、PCI Express correctable errors handling(RAS)solution implementation considerations in Metas AI/ML Training ClustersPCI Exp Error Handling(RAS)ConsiderationsCarlos Fernandez,Hardware Systems Engineer,MetaAnil Agrawal,Hardware Systems Engineer,MetaPCIe Express correctable errors handling(RAS)solution
2、implementation considerations in Metas AI/ML Training ClustersSUSTAINABLE SCALABLE COMPUTATIONAL INFRASTRUCTUREHW MGMTMetas AI/ML Training Clusters are built using a large number of PCIe devices including:GPUs,NICs,NVME Storage,and PCIe Switches.It is important to implement a robust fault handling(R
3、AS)solution within this PCIe device hierarchy to ensure target uptime,availability,and serviceability objectives.A high rate of PCIe correctable errors is expected.In this presentation,we would like to share our learnings and an innovative solution we developed to manage such large scale PCIe correc
4、table errors within Meta AI/ML training clusters.PCIe Error Handling-AbstractAI/ML Training Cluster 30K ft viewAI/ML Training Cluster-OverviewReference:https:/ Teton Training-OverviewReference: Teton Training-Platform ViewOAM:OCP Accelerator ModuleGrand Teton Training Platform-PCIe HierarchyA Large
5、PCIe Device Hierarchy Increased PCIe Correctable ErrorsB:D.F root_port,slot#,device present,power:On,speed 32GT/s,width x16B:D.F endpoint,CPU-NICB:D.F root_port,slot#,device present,power:On,speed 32GT/s,width x16B:D.F upstream_port,PCIe Gen 5 SwitchB:D.F downstream_port,slot#,device present,speed 3
6、2GT/s,width x16 B:D.F endpoint,IOX-NICB:D.F downstream_port,slot#,device present,speed 8GT/s,width x4 B:D.F endpoint,current speed 8GT/s target speed 32GT/s.B:D.F downstream_port,slot#,device present,speed 8GT/s,width x4 B:D.F endpoint,IOX-SSD,current speed 8GT/s target speed 16GT/sB:D.F downstream_