1、Marko Bartscherer,IntelEnrico Carrieri,IntelOCP Data Center Diagnostics and Debug Workstream IntroductionWith the rise of large AI systems,nodes and components are becoming more diverse and increasing in size and complexity.Components within a node may be from multiple suppliersVery complex GPU sili
2、con(2KW TDP)Nodes with 50+complex componentsClusters with 1000s of nodes working on 1 jobProblem Statement and MotivationCPUPCIeSwitchesCEM Card EDSFFGPUNVMeCEM Card NICCEM Card GPUCEM Card GPUCEM Card GPUEDSFFEDSFFEDSFFCEM Card NICCEM Card NICCEM Card NICPCIeRetimersCPUPCIeSwitchesEDSFFGPUNVMeCEM C
3、ard NICGPUGPUGPUEDSFFNVMeEDSFFNVMeEDSFFNVMeCEM Card NICCEM Card NICCEM Card NICPCIeRetimersGPUGPUGPUGPUCPUSwitchSwitchSwitchSwitchVendor 1Vendor 2Vendor 4Vendor 3Vendor 5Management ServerCPUPCIeSwitchesCEM Card EDSFFGPUNVMeCEM Card NICCEM Card GPUCEM Card GPUCEM Card GPUEDSFFEDSFFEDSFFCEM Card NICCE
4、M Card NICCEM Card NICPCIeRetimersCPUPCIeSwitchesEDSFFGPUNVMeCEM Card NICGPUGPUGPUEDSFFNVMeEDSFFNVMeEDSFFNVMeCEM Card NICCEM Card NICCEM Card NICPCIeRetimersGPUGPUGPUGPUCPUSwitchSwitchSwitchSwitchVendor 1Vendor 2Vendor 4Vendor 3Vendor 5Management ServerWith the rise of large AI systems,nodes and com
5、ponents are becoming more diverse and increasing in size and complexity.Components within a node may be from multiple suppliersVery complex GPU silicon(2KW TDP)Nodes with 50+complex componentsClusters with 1000s of nodes working on 1 jobProblem Statement and MotivationCPUPCIeSwitchesCEM Card EDSFFGP
6、UNVMeCEM Card NICCEM Card GPUCEM Card GPUCEM Card GPUEDSFFEDSFFEDSFFCEM Card NICCEM Card NICCEM Card NICPCIeRetimersCPUPCIeSwitchesEDSFFGPUNVMeCEM Card NICGPUGPUGPUEDSFFNVMeEDSFFNVMeEDSFFNVMeCEM Card NICCEM Card NICCEM Card NICPCIeRetimersGPUGPUGPUGPUCPUSwitchSwitchSwitchSwitchVendor 1Vendor 2Vendor