1、Matt Bergeron,MetaArvind Srinivasan,MetaAshutosh Kumar,KeysightCluster-less Validation of GPUs and NetworkingAccelerating AI Hardware NPICluster-less Validation of GPUs and NetworkingAccelerating AI Hardware NPIMatt Bergeron,MetaArvind Srinivasan,MetaAshutosh Kumar,KeysightNETWORKINGAll-to-All perfo
2、rmance less than rooflineoMicrobursts,buffering,CCLLarge,complex testbed/How can we validate AI System NPI without building a cluster?Recap from OCP 2023Reduce the SUTSurround the SUT with mocksand simplify the CCLThere has to be a better way.Mock testers emulate DUT CCLDuck test:If it walks like a
3、duck and it quacks like a duck,then it must be a duckHeterogeneous cluster validation,with reduced SUTOne-arm AI cluster validation128384 MiB all-to-alloWeak scaling:16 MiBCollective completion time(CCT)is less than expectedo8-rank:90%rooflineo24-rank:97%rooflineScales w/rank countWhy?Application:CC
4、T scaling24 ranks(23 physical mocks)400GbpsEach rank should be sending 17.4 Gbps rooflineKAI reports CCT 7.88 ms-17.0 Gbps(97%roofline)Bursty,but about rightBuffering is okCDFDelta between min&max FCTWant a vertical lineCCT is last flowFlowlinesVirtual ranks reduce the testbedSUT DUTKeep scalingTren
5、d continues;98%rooflineOutlier from NIC.Why?One-arm virtual ranks enables validation of an arbitrary clusteroof an arbitrary sizeThe cluster24 ranks:7.88 ms CCT256 ranks:87.5 ms CCTPossibly unfairness in NIC schedulerTo be continued.The outlierOne-arm approach allows left-shifting NPICluster-less va
6、lidationEarly performance signal for CCL,E2EGPU/NIC/AI accelerator experiences the same load and stressoCompute coresoHBMoAddressesoNetwork flows&packetsoPowerTL;DRCCL AI workloads&PyTorchNIC GPU metricsQP&flow metrics are good,but bufferin