《网络集体加速人工智能架构.pdf》由会员分享,可在线阅读,更多相关《网络集体加速人工智能架构.pdf(13页珍藏版)》请在三个皮匠报告上搜索。
1、Surendra Anubolu-BroadcomNikhil Shetty-OracleIn Network Collective acceleration for AI FabricsIn Network Collective Acceleration for AI FabricsSurendra AnuboluNikhil ShettyNetworkingBandwidth and Latency needs for AI fabricChallengesProposed In Network Collectives for Ethernet fabricShare data Tomah
2、awk Ultra In Network collective performanceCall To Action Infrastructure-APIsAgendaCollectives account for 90+%of the bandwidthAll Reduce All to AllAll GatherReduce ScatterLarge models sizes Very high bandwidthInference Low latency completionsMoE k of N multicastAI Fabric Bandwidth and latency chall
3、enges-Collectives consume most of the fabric bandwidth-Tensor Parallel and Expert Parallel have communications that are exposedAI workload traffic patternsParallelismCollectiveFabric loadTensor ParallelAllReduce50%Expert parallelAllToAll,Gather1 to 10%Sequence parallelAll Gather30 to 40%Data paralle
4、lAll Reduce 5%Pipeline PrallelP2P0.2%Example collective data transfer usageWhy offload-High-bandwidth communication is a major component of collectives-Network switches have one or two order magnitude more fabric bandwidth than end points-Predictable latency+Tail latency-Collectives require very lit
5、tle compute-51Tbps requires only 3 TFlops of BF16 adders-Some collectives like k of N do not require any computeFabric is a natural place to accelerate collectivesIn Network Collectives-offloadGPU1000 TFlops400 G-7 TbpsSwitch with INC3 Tflops50 TbpsSwitch participates in the collectiveOffloads the c
6、ollective compute such as all_reduceAt the start of the job,INC Manager allocates switch compute resources and builds a treeTree can be reused for multiple collectivesINC manager can work with load sharing facility to reserve resourcesxCCL,libfabric and MPI pluginsArchitecture fo