《1-3 STRONGHOLD:快速实惠的亿级深度学习模型训练.pdf》由会员分享,可在线阅读,更多相关《1-3 STRONGHOLD:快速实惠的亿级深度学习模型训练.pdf(26页珍藏版)》请在三个皮匠报告上搜索。
1、STRONGHOLD:Fast and Affordable Billion-scale Deep Learning Model Training 王玮达摩院NLP算法专家2022/07/30Foundation Models ML homogenizes learning algorithms(e.g.,logistic regression),DL homogenizes model architectures(e.g.,CNN)Foundation models homogenizes the model itself(e.g.,BERT,GPT-3.)figure from On th
2、e Opportunities and Risks of Foundation Models,https:/arxiv.org/abs/2108.07258Foundation Models Training+Adaptation Pretrained on broad unannotated(multimodal)data at scale via self-supervised way Adapted to a wide rage of downstream tasks via fine-tuning.One is All figure from On the Opportunities
3、and Risks of Foundation Models,https:/arxiv.org/abs/2108.07258Model Size v.s.HW Capacity Transformer Size 2*104 x/5 year GPU Memory 6x/5 year We need more GPUs!2017201820192020202120220500100015002000BERT-baseBERT-largeGPTGPT-2GPT-3Megatron-Turing-NLGMegatron-LMT-NLGT5-baseT5-largeT5-3BT5-11BT5-XXLA
4、LBERTRoBERTa-largeZhiyuan-Wudao2.0Ali-M6KUAIMODELGShardSwitch-baseSwitch-largeSwitch-XXLSwitch-C dense sparseparameters(B)DateP100(12GB)TPU V2(16GB)V100(32GB)TPU V3(32GB)A100(40GB)A100(80GB)GPU Memory Model Size Data parallelism Distribute data across processors Processed in parallel,and parameters
5、are updated synchronously Communication happens at the all-reduce operations to sum the gradients from all processorsModel parallelism Pipeline(Inter-Layer)Model Parallelism Split sets of layers across multiple devices Layer 0,1,2 and layer 3,4,5 are on different devices Tensor(Intra-Layer)Model Par
6、allelism Split individual layers across multiple devices Both devices compute difference parts of layer 0,1,2,3,4,5 These two approaches are complementaryModel parallelism Pipeline(Inter-Layer)Model Parallelism Less communication intensive Generalizable to almost all DNNs Can req