1、Alibaba Cloud | Q2SWorldwide Cloud Services PartnerWhaleA Unified Distributed Training FrameworkAngWangwangangwaalibaba-PAI, Alibaba Cloud15/12/2020WW.ALIBABACLOUD.COM#page#Motivation销阳Models are getting largerWebText Vaidaton PerplexyMemory(GB)ParametersM)20000050345M775M2.5B8.317500080401500003230
2、100000162085000010110001500340251170112P4P100V100A100EpochModels are getting largerLarger models lead to betterModel size grows far beyondresults with lower validationand more complexupgrading of hardwareperplexities1https/de#page#MotivationData parallelism becomes less optimal for lots of distribut
3、ed workloadsGrads AReduceData ParallelCDP) is widely used in distributed trainingas it is simple and easy to implement.DP is not always optimal for every distributed trainingworkloadsNecessary to find an efficient parallel strategy that canGPUOGPU1make full use of the resources and speedup the train
4、ingDistribute the trainingworkload wiith data paralleliism#page#期MotivationData parallelism becomes less optimal for lots of distributed workloadsBO3中中中EgBertLargeE.g.VGG16E.g.T5GPT-3Some layers contribute most ofDifficult to increase the batch sizein aModel size is far larger than thesingle GPU dev
5、ice due to the limitation ofparameters but have asmall proportionmemory size of a single GPUthe GPU device memory capacity.device.of computation such as FC layers inVGG16.Large weight size,long communicationUnable to train the model unlessHard to overlap computation withtime lead to poor scalability
6、.adopting the model parallelismcommunication. lead to poor scalabilityeras-for-beqinners-ow.ai/2019/05/21/167211https:ed-transformer/#page#阿里云Q9Distributed Model Training ApproachGPU1Grads AIReduceGPUOGPU1GPUOGPU1GPUOData ParallelismPipeline ParallelismOperator ShardingHybrid Parallelism#page#WhaleA