《Scheduled Ethernet Fabric for Large scale AI training cluster.pdf》由会员分享,可在线阅读,更多相关《Scheduled Ethernet Fabric for Large scale AI training cluster.pdf(14页珍藏版)》请在三个皮匠报告上搜索。
1、Scheduled Ethernet Fabric for Large scale AI training clustersPengfei Huo,Senior Network Architect,ByteDanceRajasekar Jegannathan,Lead AI-InfrastructureOozie Parizer,Senior Director,Product Marketing,BroadcomScheduled Ethernet Fabric for Large scale AI training clustersARTIFICIAL INTELLIGENCE(AI)NET
2、WORKINGChallenges in AI training fabric How Scheduled Ethernet Fabric worksBenchmark ResultsCall for Actions AgendaChallenges in AI NetworkSmall Number of flowsHigh bandwidth per flowGPUs drive high bandwidthHigh Demand for Network Throughput/UtilizationJob ConcurrencyUnexpected Network failuresTopo
3、logy changes Hash PolarizationUneven Link UtilizationOut of OrderHOL BlockingCongestion Hot spotsPFC propagationCrosstalk between JobsSlow failure detection/failoverFailovers create congestionWorkload CharacteristicsChallenges to NetworkEqual spraying over all links of the fabric,independent of flow
4、 sizeUniform link utilization avoids hot spotsConsistent high performance at all network loads Perfect Load BalancingABEnd-to-end scheduled fabric No PFC propagationIsolation of“slow receivers”-no HOL blockingExcels regardless of the workload patternsJust works“out of box”regardless of type/performa
5、nce of endpointsNative multi-tenancy support Congestion-Free OperationABSelf-healing fabric:hardware-based failure detect and recoveryLinear,predictable change in performance with link failuresNon-scheduled fabrics may have unpredictable,greater than linear degradation with high convergence timeFewe
6、r checkpoints shorter JobCompletion Time(JCT)Zero Impact Failover(ZIF)1:12:11:1 bandwidth ratio for leaf to GPU and to spine 2:1 oversub for leaf to GPU and to spine Network Topology for Benchmark Test200G200G200G200G200G200GGPU-1GPU-2GP-127GPU-128GPU-1GPU-2GPU-127GPU-128Scheduled FabricTypical Ethe