1-3 STRONGHOLD：快速实惠的亿级深度学习模型训练.pdf

上传人：云闲

编号：102308

2021-01-01

PDF 26页 3.38MB

《1-3 STRONGHOLD：快速实惠的亿级深度学习模型训练.pdf》由会员分享，可在线阅读，更多相关《1-3 STRONGHOLD：快速实惠的亿级深度学习模型训练.pdf（26页珍藏版）》请在三个皮匠报告上搜索。

1、STRONGHOLD:Fast and Affordable Billion-scale Deep Learning Model Training 王玮达摩院NLP算法专家2022/07/30Foundation Models ML homogenizes learning algorithms(e.g.,logistic regression),DL homogenizes model architectures(e.g.,CNN)Foundation models homogenizes the model itself(e.g.,BERT,GPT-3.)figure from On th

2、e Opportunities and Risks of Foundation Models,https:/arxiv.org/abs/2108.07258Foundation Models Training+Adaptation Pretrained on broad unannotated(multimodal)data at scale via self-supervised way Adapted to a wide rage of downstream tasks via fine-tuning.One is All figure from On the Opportunities

3、and Risks of Foundation Models,https:/arxiv.org/abs/2108.07258Model Size v.s.HW Capacity Transformer Size 2*104 x/5 year GPU Memory 6x/5 year We need more GPUs!2017201820192020202120220500100015002000BERT-baseBERT-largeGPTGPT-2GPT-3Megatron-Turing-NLGMegatron-LMT-NLGT5-baseT5-largeT5-3BT5-11BT5-XXLA

4、LBERTRoBERTa-largeZhiyuan-Wudao2.0Ali-M6KUAIMODELGShardSwitch-baseSwitch-largeSwitch-XXLSwitch-C dense sparseparameters(B)DateP100(12GB)TPU V2(16GB)V100(32GB)TPU V3(32GB)A100(40GB)A100(80GB)GPU Memory Model Size Data parallelism Distribute data across processors Processed in parallel,and parameters

5、are updated synchronously Communication happens at the all-reduce operations to sum the gradients from all processorsModel parallelism Pipeline(Inter-Layer)Model Parallelism Split sets of layers across multiple devices Layer 0,1,2 and layer 3,4,5 are on different devices Tensor(Intra-Layer)Model Par

6、allelism Split individual layers across multiple devices Both devices compute difference parts of layer 0,1,2,3,4,5 These two approaches are complementaryModel parallelism Pipeline(Inter-Layer)Model Parallelism Less communication intensive Generalizable to almost all DNNs Can req

1-3 STRONGHOLD：快速实惠的亿级深度学习模型训练.pdf

相关报告