《Scaling Remote direct memory access (RDMA) networks for AI Training.pdf》由会员分享,可在线阅读,更多相关《Scaling Remote direct memory access (RDMA) networks for AI Training.pdf(22页珍藏版)》请在三个皮匠报告上搜索。
1、Challenges and OpportunitiesScaling RDMA Networks for AI Training with RoCEV2NetworkingScaling RDMA Networks for AI Training Adi GangidiProduction Network Engineer,AI Systems,MetaRohit PuriSoftware Engineer,AI Network Systems,MetaAgendaMeta RDMA deployment overviewOperational ChallengesSolutionsChal
2、lenges we continue to solveCommodity Ethernet InfrastructureMulti-hop Network based on commodity Network switchesOur Production RDMA network is unique Native for AI workloadsOur AI Training workloads use network+compute as single large system.Network primitives are integral part of the overall appli
3、cation pipelineRoCEV2,LosslessTransportWe use RoCEV2(RDMA over converged ethernet-Routable version)transport in its native behavior.We configured the network to be lossless by using PFC/DCQCNLarge deployment ScaleThousands of endpoints in a single fabric.Many of such network fabric instancesOverview
4、 of our RDMA DeploymentWorkloads/Collectives GPU based communication primitivesAll to All and All-Reduce Topology2 stage Clos fabricCommodity Ethernet switchesRoutingStatic Routing with ECMP fallback during failuresTransportRoCEV2Congestion Control(DCQCN)Priority flow control(PFC)+Explicit congestio
5、n notification(ECN)PFC helps us achieve“Lossless fabric”End-pointsBuilt with Open Compute form factor components ZionEXWedge400 ToR Switch for in-rack RDMA SwitchingEach ZionEX servers:8x GPU and 8 NICsWithin server:Shared memory fabricAcross servers:RDMA fabric RDMA Network integral to Application
6、Pipeline XX+1X+2X+3X+4X+5All Reduce#n+1All Reduce#nAll to All#1 All to All#nAll Reduce#1AlltoAll#2All to All#nCompute#1Compute#2Compute#nCompute#n+1Compute#n+2Rough Timeline of a Single Training Iteration from a Single End-point Does not scaleDoes not represent a trace of our workloadsNetwork Collec