1、IN-DEPTH ANALYSIS OF THE PERFORMANCE FOR GPT3颜子杰NVIDIALLM TRAINING TECHSLARGER MODEL IS THE TRENDCHALLENGES FOR TRAINING LARGE MODELCompute cost Lower bound of each iteration:(Refer to https:/arxiv.org/abs/2104.04473)B:batch size,S:sequence length,l:transformer layer numberh:hidden size,V:vocabulary
2、 size 2150 ZettaFLOPs(175B with 1.5T tokens)1 ZettaFLOP=1024 ExaFLOPs96!(1+6+16)Challenges 128 DGX A100,trained in 170 120 days.(about 50%computing efficiency)High Compute CostsCHALLENGES FOR TRAINING LARGE MODELMemory costs(Mixed Precision,Native implementation)Model States(Total:3.5TB)Parameter:35
3、0GB(175B*2Bytes)Gradient:350GB Optimizer:2800GB Activation:?Challenges Model can not fit in single GPU or even single GPU server.(35p A100 80G)Model parallelism is a MUST across multi nodesHigh Memory CostsCHALLENGES FOR TRAINING LARGE MODEL Model can not fit in single GPU or even single GPU server.
4、(35p A100-80G)Extremely huge computing power:about 16K A100*days computing.(Not considering efficiency)What we need:An efficient framework with model parallel Careful co-design of software and system7NeMo and Megatron-LM is NVIDIAs FW for efficiently training the worlds largest transformer-based mod
5、els.Train transformer models with billions of parametersAchieve high utilization and scaling to thousands of GPUs7NeMo and MEGATRONOVERVIEW OF LARGE TRANSFORMER TRAINING TECHNIQUES Parallelisms:Pipeline Parallelism Tensor Parallelism Sequence Parallelism Expert Parallelism Memory Optimizations:Distr
6、ibuted optimizer (DeepSpeed ZeRO-1)Checkpoint activations Selective activation checkpointing Others:FP16/BF16 training,optimized kernels,etc Communication overlapping for PP and TPBlue for Megatron v2 featuresGreen for Megatron v3 new featuresTHE DISTRIBUTED TRAINING OF GPT-3 MODELModel Parallelism