1、Optcast:Open-Source Aggregation Offloading for Distributed Deep LearningNariakiTateiwa,Researcher,NTT(NIPPON TELEGRAPH AND TELEPHONE CORPORATION)Optcast:Open-Source Aggregation Offloading for Distributed Deep LearningARTIFICIAL INTELLIGENCE(AI)AIAllreduce Communication Bottleneck in Distributed MLWo
2、rkloadBenefits of Aggregation OffloadingOffloading aggregation operation to other resources can reduce Allreduce data transfer in half compared to Ring-Allreduce.Limitations of Existing Offloading ToolsThese tools lack support for Ethernet-based protocols like RoCE.They require specific hardware for
3、 operation.Our Proposal:OptcastOur tool can 2x speed-up Allreduce over Ring-Allreduce,supports RoCE/InfiniBand/AWS-EFA/Socket protocols on commodity hardwareOptcastis an open-source prototype.Give it a try!OverviewDistributed Deep Learning TrafficData Parallelism syncs all processes after communicat
4、ion phase,which is the most common parallel strategy.Allreduce collective is executed in the communication phase.Allreduce data size grows with model parameters.AI models have expanded 1000 x in 3 yearsComputeComm.Sync.Model#parametersTotal Allreduce size per comm.phaseGPT-31.75B6.5 GB*PaLM540B2012.
5、7 GB*Data Parallel Workload*Actually,Allreduce is executed in parallel by distributed worker groups.33NodeAllreduce00+11+22+Aggregation is performed on other computing resources(switch ASIC,server,FPGA,etc.)with well-pipelined transmissionIt reduces Allreduce data transfer by about half compared to
6、the conventional Ring-Allreduce.nExisting technologiesSHARP(Offloading to Switch ASIC)Reduction server of Google Vertex AI(Offloading to Servers)Aggregation OffloadingSHARPReduction Server33ProcessSwitch00+11+22+33ProcessServer00+11+22+ServerServerThey support only certain transport protocols,such a