《张辰-摩尔线程全功能GPU大规模语言模型分布式训练性能优化探索.pdf》由会员分享,可在线阅读,更多相关《张辰-摩尔线程全功能GPU大规模语言模型分布式训练性能优化探索.pdf(27页珍藏版)》请在三个皮匠报告上搜索。
1、ML-SummitML-Summitwww.cpp-www.ml-summit.orgwww.gosim.orgwww.pm-summit.orgML-SummitML-SummitML-SummitML-SummitML-SummitML-Summit张张辰辰 摩摩尔尔线线程程资资深深算算法法工工程程师师,前前腾腾讯讯高高级级算算法法研研究究员员负责摩尔线程分布式训练方面研发工作 NLP方向从业十年以上,专注于NLP算法、分布式训练、大规模优化方向 曾经参与腾讯搜一搜业务优化、带队参加CLUE大模型benchemark测评,以1B以下小模型获得Top 10 深度学习方向老兵,MXNet.cp
2、p Commiter演演讲讲主主题题:摩摩尔尔线线程程全全功功能能G GP PU U大大规规模模语语言言模模型型分分布布式式训训练练性性能能优优化化探探索索ML-SummitML-SummitMooreThreads Full-Featured GPU Distributed Training Performance Optimization Exploration for Large-Scale Language ModelsChen ZhangMooreThreadsML-SummitML-SummitMT Megatron IntroductionML-SummitML-SummitMT
3、 Megatron IntroductionMT Megatron teams historical performanceML-SummitML-SummitMT Megatron IntroductionSupport for various training strategies;support for FP8 mixed-precision trainingML-SummitML-SummitPerformance OptimizationML-SummitML-SummitLlama Performance Optimization0102030405060708090100llam
4、a3 8Bllama3 8BOptimization for dense modelsML-SummitML-SummitDeepseek Performance OptimizationModelingMLABalancingDual ppMTPLoss AlignmentCompare ToolsProfilingMT ProfilerMT HTAPerformance EstimationSimumaxOptimizationFusionRecomputeML-SummitML-SummitDeepSeek Perf:ModelingDevice Limited LossDevice L
5、imited RouterSequence Aux lossComm Balance Losstoken drop strategyNode Limited RoutingAux Free Routingpost/pre-Normalized routing scoreML-SummitML-SummitDeepSeek Perf:Loss AlignmentA complete set of precision alignment processescomparison toolsML-SummitML-SummitDeepSeek Perf:ProfilingUse MT Profiler
6、 to obtain baseline dataperform calm analysis on the data using MT HTAaccurately estimate bottleneck gains0510152025303540ML-SummitML-SummitDeepSeek Perf:Performance EstimationUse Simumax for performance estimation and automatic parallelizationML-SummitML-Summ