1、Anil AgrawalSukay LuhadiaMeta AI/ML System Error Handling Improvements-PCIe Completion Timeout error handling using RPPIO error reportingMeta AI/ML System Error Handling Improvements-PCIe Completion Timeout error handling using RPPIO error reportingAnil AgrawalSukay LuhadiaARTIFICIAL INTELLIGENCE(AI
2、)AL/ML Training Cluster OverviewAI/ML Training Job interruptions-a challengePCIe Completion Timeout Error-Diagnosis challengeRPPIO Error Reporting to address the challengeCall to ActionAgendaAI/ML Training Cluster-OverviewAI/ML Training Cluster 30K ft viewGrand Teton Training System-OverviewReferenc
3、e: Teton Training Platform-ArchitectureOAM:OCP Accelerator ModuleGrand Teton Platform-PCIe Hierarchy ExampleA Large PCIe Device Hierarchy Increased Platform Failure Blast RadiusB:D.F root_port,slot#,device present,power:On,speed 32GT/s,width x16B:D.F endpoint,CPU-NICB:D.F root_port,slot#,device pres
4、ent,power:On,speed 32GT/s,width x16B:D.F upstream_port,PCIe Gen 5 SwitchB:D.F downstream_port,slot#,device present,speed 32GT/s,width x16 B:D.F endpoint,IOX-NICB:D.F downstream_port,slot#,device present,speed 8GT/s,width x4 B:D.F endpoint,current speed 8GT/s target speed 32GT/s.B:D.F downstream_port
5、,slot#,device present,speed 8GT/s,width x4 B:D.F endpoint,IOX-SSD,current speed 8GT/s target speed 16GT/sB:D.F downstream_port,slot#,device present,speed 32GT/s,width x16 B:D.F endpoint,GPUB:D.F downstream_port,speed 32GT/s,width x16 B:D.F endpoint,PCIe Gen 5 SwitchB:D.F downstream_port,speed 32GT/s
6、,width x16B:D.F endpoint,PCIe Switch management endpointB:D.F root_port,slot#,device present,speed 32GT/s,width x16B:D.F upstream_port,PCIe Gen 5 SwitchB:D.F downstream_port,slot#,device present,speed 32GT/s,width x16 B:D.F endpoint,IOX-NIC2B:D.F downstream_port,slot#,device present,speed 8GT/s,widt