《DeepSeek V3技术报告(英文版)(53页).pdf》由会员分享,可在线阅读,更多相关《DeepSeek V3技术报告(英文版)(53页).pdf(53页珍藏版)》请在三个皮匠报告上搜索。
1、DeepSeek-V3 Technical ReportDeepSeek-AIAbstractWe present DeepSeek-V3,a strong Mixture-of-Experts(MoE)language model with 671B totalparameters with 37B activated for each token.To achieve efficient inference and cost-effectivetraining,DeepSeek-V3 adopts Multi-head Latent Attention(MLA)and DeepSeekMo
2、E architec-tures,which were thoroughly validated in DeepSeek-V2.Furthermore,DeepSeek-V3 pioneersan auxiliary-loss-free strategy for load balancing and sets a multi-token prediction trainingobjective for stronger performance.We pre-train DeepSeek-V3 on 14.8 trillion diverse andhigh-quality tokens,fol
3、lowed by Supervised Fine-Tuning and Reinforcement Learning stages tofully harness its capabilities.Comprehensive evaluations reveal that DeepSeek-V3 outperformsother open-source models and achieves performance comparable to leading closed-sourcemodels.Despite its excellent performance,DeepSeek-V3 re
4、quires only 2.788M H800 GPU hoursfor its full training.In addition,its training process is remarkably stable.Throughout the entiretraining process,we did not experience any irrecoverable loss spikes or perform any rollbacks.The model checkpoints are available athttps:/ 500(EM)AIME 2024(Pass1)Codefor
5、ces(Percentile)SWE-bench Verified(Resolved)020406080100Accuracy/Percentile(%)75.959.190.239.251.642.066.241.374.716.735.622.671.649.080.023.324.823.873.351.173.823.325.324.572.649.974.69.323.638.878.065.078.316.020.350.8DeepSeek-V3DeepSeek-V2.5Qwen2.5-72B-InstLlama-3.1-405B-InstGPT-4o-0513Claude-3.5
6、-Sonnet-1022Figure 1|Benchmark performance of DeepSeek-V3 and its counterparts.Contents1Introduction42Architecture62.1Basic Architecture.62.1.1Multi-Head Latent Attention.72.1.2DeepSeekMoE with Auxiliary-Loss-Free Load Balancing.82.2Multi-Token Prediction.103Infrastructures113.1Compute Clusters.113.
7、2Training Framework.123.2.1DualPipe and Computation-Communication Overlap.123.2.2Efficient Implementation of Cross-Node All-to-All Communication.133.2.3Extremely Memory Saving with Minimal Overhead.143.3FP8 Training.143.3.1Mixed Precision Framework.153.3.2Improved Precision from Quantization and Mul
8、tiplication.163.3.3Low-Precision Storage and Communication.183.4Inference and Deployment.183.4.1Prefilling.193.4.2Decoding.193.5Suggestions on Hardware Design.203.5.1Communication Hardware.203.5.2Compute Hardware.204Pre-Training224.1Data Construction.224.2Hyper-Parameters.224.3Long Context Extension
9、.234.4Evaluations.244.4.1Evaluation Benchmarks.244.4.2Evaluation Results.254.5Discussion.264.5.1Ablation Studies for Multi-Token Prediction.264.5.2Ablation Studies for the Auxiliary-Loss-Free Balancing Strategy.2724.5.3Batch-Wise Load Balance VS.Sequence-Wise Load Balance.275Post-Training285.1Superv
10、ised Fine-Tuning.285.2Reinforcement Learning.295.2.1Reward Model.295.2.2Group Relative Policy Optimization.305.3Evaluations.305.3.1Evaluation Settings.305.3.2Standard Evaluation.325.3.3Open-Ended Evaluation.335.3.4DeepSeek-V3 as a Generative Reward Model.335.4Discussion.345.4.1Distillation from Deep
11、Seek-R1.345.4.2Self-Rewarding.345.4.3Multi-Token Prediction Evaluation.356Conclusion,Limitations,and Future Directions35A Contributions and Acknowledgments45B Ablation Studies for Low-Precision Training47B.1FP8 v.s.BF16 Training.47B.2Discussion About Block-Wise Quantization.47C Expert Specialization
12、 Patterns of the 16B Aux-Loss-Based and Aux-Loss-Free Models 4831.IntroductionIn recent years,Large Language Models(LLMs)have been undergoing rapid iteration andevolution(Anthropic,2024;Google,2024;OpenAI,2024a),progressively diminishing the gap to-wards Artificial General Intelligence(AGI).Beyond c
13、losed-source models,open-source models,including DeepSeek series(DeepSeek-AI,2024a,b,c;Guo et al.,2024),LLaMA series(AIMeta,2024a,b;Touvron et al.,2023a,b),Qwen series(Qwen,2023,2024a,b),and Mistral series(Jianget al.,2023;Mistral,2024),are also making significant strides,endeavoring to close the ga
14、p withtheir closed-source counterparts.To further push the boundaries of open-source model capa-bilities,we scale up our models and introduce DeepSeek-V3,a large Mixture-of-Experts(MoE)model with 671B parameters,of which 37B are activated for each token.With a forward-looking perspective,we consiste
15、ntly strive for strong model performanceand economical costs.Therefore,in terms of architecture,DeepSeek-V3 still adopts Multi-headLatent Attention(MLA)(DeepSeek-AI,2024c)for efficient inference and DeepSeekMoE(Daiet al.,2024)for cost-effective training.These two architectures have been validated in
16、 DeepSeek-V2(DeepSeek-AI,2024c),demonstrating their capability to maintain robust model performancewhile achieving efficient training and inference.Beyond the basic architecture,we implementtwo additional strategies to further enhance the model capabilities.Firstly,DeepSeek-V3 pi-oneers an auxiliary
17、-loss-free strategy(Wang et al.,2024a)for load balancing,with the aim ofminimizing the adverse impact on model performance that arises from the effort to encourageload balancing.Secondly,DeepSeek-V3 employs a multi-token prediction training objective,which we have observed to enhance the overall per
18、formance on evaluation benchmarks.In order to achieve efficient training,we support the FP8 mixed precision training andimplement comprehensive optimizations for the training framework.Low-precision traininghas emerged as a promising solution for efficient training(Dettmers et al.,2022;Kalamkar et a
19、l.,2019;Narang et al.,2017;Peng et al.,2023b),its evolution being closely tied to advancements inhardware capabilities(Luo et al.,2024;Micikevicius et al.,2022;Rouhani et al.,2023a).In thiswork,we introduce an FP8 mixed precision training framework and,for the first time,validateits effectiveness on
20、 an extremely large-scale model.Through the support for FP8 computationand storage,we achieve both accelerated training and reduced GPU memory usage.As forthe training framework,we design the DualPipe algorithm for efficient pipeline parallelism,which has fewer pipeline bubbles and hides most of the
21、 communication during training throughcomputation-communication overlap.This overlap ensures that,as the model further scales up,as long as we maintain a constant computation-to-communication ratio,we can still employfine-grained experts across nodes while achieving a near-zero all-to-all communicat
22、ion overhead.In addition,we also develop efficient cross-node all-to-all communication kernels to fully utilizeInfiniBand(IB)and NVLink bandwidths.Furthermore,we meticulously optimize the memoryfootprint,making it possible to train DeepSeek-V3 without using costly tensor parallelism.Combining these
23、efforts,we achieve high training efficiency.During pre-training,we train DeepSeek-V3 on 14.8T high-quality and diverse tokens.Thepre-training process is remarkably stable.Throughout the entire training process,we did notencounter any irrecoverable loss spikes or have to roll back.Next,we conduct a t
24、wo-stagecontext length extension for DeepSeek-V3.In the first stage,the maximum context length isextended to 32K,and in the second stage,it is further extended to 128K.Following this,weconduct post-training,including Supervised Fine-Tuning(SFT)and Reinforcement Learning(RL)on the base model of DeepS
25、eek-V3,to align it with human preferences and further unlock itspotential.During the post-training stage,we distill the reasoning capability from the DeepSeek-R1 series of models,and meanwhile carefully maintain the balance between model accuracy4Training CostsPre-TrainingContext ExtensionPost-Train
26、ingTotalin H800 GPU Hours2664K119K5K2788Kin USD$5.328M$0.238M$0.01M$5.576MTable 1|Training costs of DeepSeek-V3,assuming the rental price of H800 is$2 per GPU hour.and generation length.We evaluate DeepSeek-V3 on a comprehensive array of benchmarks.Despite its economicaltraining costs,comprehensive
27、evaluations reveal that DeepSeek-V3-Base has emerged as thestrongest open-source base model currently available,especially in code and math.Its chatversion also outperforms other open-source models and achieves performance comparable toleading closed-source models,including GPT-4o and Claude-3.5-Son
28、net,on a series of standardand open-ended benchmarks.Lastly,we emphasize again the economical training costs of DeepSeek-V3,summarized inTable 1,achieved through our optimized co-design of algorithms,frameworks,and hardware.During the pre-training stage,training DeepSeek-V3 on each trillion tokens r
29、equires only 180KH800 GPU hours,i.e.,3.7 days on our cluster with 2048 H800 GPUs.Consequently,our pre-training stage is completed in less than two months and costs 2664K GPU hours.Combinedwith 119K GPU hours for the context length extension and 5K GPU hours for post-training,DeepSeek-V3 costs only 2
30、.788M GPU hours for its full training.Assuming the rental price ofthe H800 GPU is$2 per GPU hour,our total training costs amount to only$5.576M.Note thatthe aforementioned costs include only the official training of DeepSeek-V3,excluding the costsassociated with prior research and ablation experimen
31、ts on architectures,algorithms,or data.Our main contribution includes:Architecture:Innovative Load Balancing Strategy and Training ObjectiveOn top of the efficient architecture of DeepSeek-V2,we pioneer an auxiliary-loss-freestrategy for load balancing,which minimizes the performance degradation tha
32、t arisesfrom encouraging load balancing.We investigate a Multi-Token Prediction(MTP)objective and prove it beneficial to modelperformance.It can also be used for speculative decoding for inference acceleration.Pre-Training:Towards Ultimate Training EfficiencyWe design an FP8 mixed precision training
33、 framework and,for the first time,validate thefeasibility and effectiveness of FP8 training on an extremely large-scale model.Through the co-design of algorithms,frameworks,and hardware,we overcome thecommunication bottleneck in cross-node MoE training,achieving near-full computation-communication o
34、verlap.This significantly enhances our training efficiency and reduces thetraining costs,enabling us to further scale up the model size without additional overhead.At an economical cost of only 2.664M H800 GPU hours,we complete the pre-training ofDeepSeek-V3 on 14.8T tokens,producing the currently s
35、trongest open-source base model.The subsequent training stages after pre-training require only 0.1M GPU hours.Post-Training:Knowledge Distillation from DeepSeek-R1We introduce an innovative methodology to distill reasoning capabilities from the long-Chain-of-Thought(CoT)model,specifically from one o
36、f the DeepSeek R1 series models,into standard LLMs,particularly DeepSeek-V3.Our pipeline elegantly incorporates the5verification and reflection patterns of R1 into DeepSeek-V3 and notably improves itsreasoning performance.Meanwhile,we also maintain control over the output style andlength of DeepSeek
37、-V3.Summary of Core Evaluation ResultsKnowledge:(1)On educational benchmarks such as MMLU,MMLU-Pro,and GPQA,DeepSeek-V3 outperforms all other open-source models,achieving 88.5 on MMLU,75.9on MMLU-Pro,and 59.1 on GPQA.Its performance is comparable to leading closed-sourcemodels like GPT-4o and Claude
38、-Sonnet-3.5,narrowing the gap between open-sourceand closed-source models in this domain.(2)For factuality benchmarks,DeepSeek-V3demonstrates superior performance among open-source models on both SimpleQA andChinese SimpleQA.While it trails behind GPT-4o and Claude-Sonnet-3.5 in English factualknowl
39、edge(SimpleQA),it surpasses these models in Chinese factual knowledge(ChineseSimpleQA),highlighting its strength in Chinese factual knowledge.Code,Math,and Reasoning:(1)DeepSeek-V3 achieves state-of-the-art performance onmath-related benchmarks among all non-long-CoT open-source and closed-source mo
40、dels.Notably,it even outperforms o1-preview on specific benchmarks,such as MATH-500,demonstrating its robust mathematical reasoning capabilities.(2)On coding-related tasks,DeepSeek-V3 emerges as the top-performing model for coding competition benchmarks,such as LiveCodeBench,solidifying its position
41、 as the leading model in this domain.Forengineering-related tasks,while DeepSeek-V3 performs slightly below Claude-Sonnet-3.5,it still outpaces all other models by a significant margin,demonstrating its competitivenessacross diverse technical benchmarks.In the remainder of this paper,we first presen
42、t a detailed exposition of our DeepSeek-V3model architecture(Section 2).Subsequently,we introduce our infrastructures,encompassingour compute clusters,the training framework,the support for FP8 training,the inferencedeployment strategy,and our suggestions on future hardware design.Next,we describe o
43、urpre-training process,including the construction of training data,hyper-parameter settings,long-context extension techniques,the associated evaluations,as well as some discussions(Section 4).Thereafter,we discuss our efforts on post-training,which include Supervised Fine-Tuning(SFT),Reinforcement L
44、earning(RL),the corresponding evaluations,and discussions(Section 5).Lastly,we conclude this work,discuss existing limitations of DeepSeek-V3,and propose potentialdirections for future research(Section 6).2.ArchitectureWe first introduce the basic architecture of DeepSeek-V3,featured by Multi-head L
45、atent Atten-tion(MLA)(DeepSeek-AI,2024c)for efficient inference and DeepSeekMoE(Dai et al.,2024)for economical training.Then,we present a Multi-Token Prediction(MTP)training objective,which we have observed to enhance the overall performance on evaluation benchmarks.Forother minor details not explic
46、itly mentioned,DeepSeek-V3 adheres to the settings of DeepSeek-V2(DeepSeek-AI,2024c).2.1.Basic ArchitectureThe basic architecture of DeepSeek-V3 is still within the Transformer(Vaswani et al.,2017)framework.For efficient inference and economical training,DeepSeek-V3 also adopts MLAand DeepSeekMoE,wh
47、ich have been thoroughly validated by DeepSeek-V2.Compared withDeepSeek-V2,anexceptionisthatweadditionallyintroduceanauxiliary-loss-freeloadbalancing6RouterInput Hidden Output Hidden 112-1Shared ExpertRouted ExpertTop-AttentionFeed-Forward Network34RMSNormRMSNormTransformer Block DeepSeekMoE0Input H
48、idden Multi-Head Latent Attention(MLA)0,Latent Latent,Cached During InferenceMulti-Head Attentionconcatenateconcatenate,;,;Output Hidden 1 applyRoPEapplyRoPEFigure 2|Illustration of the basic architecture of DeepSeek-V3.Following DeepSeek-V2,weadopt MLA and DeepSeekMoE for efficient inference and ec
49、onomical training.strategy(Wang et al.,2024a)for DeepSeekMoE to mitigate the performance degradation inducedby the effort to ensure load balance.Figure 2 illustrates the basic architecture of DeepSeek-V3,and we will briefly review the details of MLA and DeepSeekMoE in this section.2.1.1.Multi-Head L
50、atent AttentionFor attention,DeepSeek-V3 adopts the MLA architecture.Letdenote the embedding dimen-sion,denote the number of attention heads,denote the dimension per head,and h Rdenote the attention input for the-th token at a given attention layer.The core of MLA is thelow-rank joint compression fo
51、r attention keys and values to reduce Key-Value(KV)cache duringinference:c=h,(1)k,1;k,2;.;k,=k=c,(2)k=RoPE(h),(3)k,=k,;k,(4)v,1;v,2;.;v,=v=c,(5)7where c Ris the compressed latent vector for keys and values;()indicates the KVcompression dimension;Rdenotes the down-projection matrix;,Rare the up-proje
52、ction matrices for keys and values,respectively;Ris the matrix usedto produce the decoupled key that carries Rotary Positional Embedding(RoPE)(Su et al.,2024);RoPE()denotes the operation that applies RoPE matrices;and;denotes concatenation.Notethat for MLA,only the blue-boxed vectors(i.e.,cand k)nee
53、d to be cached during generation,which results in significantly reduced KV cache while maintaining performance comparable tostandard Multi-Head Attention(MHA)(Vaswani et al.,2017).For the attention queries,we also perform a low-rank compression,which can reduce theactivation memory during training:c
54、=h,(6)q,1;q,2;.;q,=q=c,(7)q,1;q,2;.;q,=q=RoPE(c),(8)q,=q,;q,(9)where c Ris the compressed latent vector for queries;()denotes the querycompression dimension;R,Rare the down-projection and up-projectionmatrices for queries,respectively;and Ris the matrix to produce the decoupledqueries that carry RoP
55、E.Ultimately,the attention queries(q,),keys(k,),and values(v,)are combined to yield thefinal attention output u:o,=1Softmax(q,k,+)v,(10)u=o,1;o,2;.;o,(11)where Rdenotes the output projection matrix.2.1.2.DeepSeekMoE with Auxiliary-Loss-Free Load BalancingBasic Architecture of DeepSeekMoE.For Feed-Fo
56、rward Networks(FFNs),DeepSeek-V3employs the DeepSeekMoE architecture(Dai et al.,2024).Compared with traditional MoEarchitectures like GShard(Lepikhin et al.,2021),DeepSeekMoE uses finer-grained experts andisolates some experts as shared ones.Let udenote the FFN input of the-th token,we computethe FF
57、N output has follows:h=u+=1FFN()(u)+=1,FFN()(u),(12),=,=1,(13),=(,Topk(,|1,),0,otherwise,(14),=Sigmoid?ue?,(15)8whereanddenotethenumbersofsharedexpertsandroutedexperts,respectively;FFN()()andFFN()()denote the-th shared expert and the-th routed expert,respectively;denotesthe number of activated route
58、d experts;,is the gating value for the-th expert;,is thetoken-to-expert affinity;eis the centroid vector of the-th routed expert;andTopk(,)denotesthe set comprisinghighest scores among the affinity scores calculated for the-th token andall routed experts.Slightly different from DeepSeek-V2,DeepSeek-
59、V3 uses the sigmoid functionto compute the affinity scores,and applies a normalization among all selected affinity scores toproduce the gating values.Auxiliary-Loss-Free Load Balancing.For MoE models,an unbalanced expert load will lead torouting collapse(Shazeer et al.,2017)and diminish computationa
60、l efficiency in scenarios withexpert parallelism.Conventional solutions usually rely on the auxiliary loss(Fedus et al.,2021;Lepikhin et al.,2021)to avoid unbalanced load.However,too large an auxiliary loss will impairthe model performance(Wang et al.,2024a).To achieve a better trade-off between loa
61、d balanceand model performance,we pioneer an auxiliary-loss-free load balancing strategy(Wang et al.,2024a)to ensure load balance.To be specific,we introduce a bias termfor each expert andadd it to the corresponding affinity scores,to determine the top-K routing:,=(,+Topk(,+|1,),0,otherwise.(16)Note
62、 that the bias term is only used for routing.The gating value,which will be multiplied withthe FFN output,is still derived from the original affinity score,.During training,we keepmonitoring the expert load on the whole batch of each training step.At the end of each step,we will decrease the bias te
63、rm byif its corresponding expert is overloaded,and increase it byif its corresponding expert is underloaded,whereis a hyper-parameter called bias updatespeed.Through the dynamic adjustment,DeepSeek-V3 keeps balanced expert load duringtraining,and achieves better performance than models that encourag
64、e load balance throughpure auxiliary losses.Complementary Sequence-Wise Auxiliary Loss.Although DeepSeek-V3 mainly relies on theauxiliary-loss-free strategy for load balance,to prevent extreme imbalance within any singlesequence,we also employ a complementary sequence-wise balance loss:LBal=1,(17)=1
65、1?,Topk(,|1,)?,(18),=,=1,(19)=1=1,(20)where the balance factoris a hyper-parameter,which will be assigned an extremely smallvalue for DeepSeek-V3;1()denotes the indicator function;anddenotes the number of tokensin a sequence.The sequence-wise balance loss encourages the expert load on each sequence
66、tobe balanced.9Main Model(Next Token Prediction)Embedding LayerOutput HeadOutput HeadTransformer Block Embedding Layer23413452RMSNormRMSNormLinear ProjectionMTP Module 1(Next2 Token Prediction)SharedSharedconcatenationOutput HeadTransformer Block Embedding LayerLinear ProjectionMTP Module 2(Next3 To
67、ken Prediction)concatenationSharedShared3452456356744563Transformer Block Transformer Block Transformer Block Transformer Block Transformer Block Cross-Entropy LossCross-Entropy LossCross-Entropy LossInput TokensTarget TokensRMSNormRMSNormMTP1MTP2Figure 3|Illustration of our Multi-Token Prediction(M
68、TP)implementation.We keep thecomplete causal chain for the prediction of each token at each depth.Node-Limited Routing.Like the device-limited routing used by DeepSeek-V2,DeepSeek-V3also uses a restricted routing mechanism to limit communication costs during training.In short,we ensure that each tok
69、en will be sent to at mostnodes,which are selected according tothe sum of the highestaffinity scores of the experts distributed on each node.Under thisconstraint,our MoE training framework can nearly achieve full computation-communicationoverlap.No Token-Dropping.Due to the effective load balancing
70、strategy,DeepSeek-V3 keeps a goodload balance during its full training.Therefore,DeepSeek-V3 does not drop any tokens duringtraining.In addition,we also implement specific deployment strategies to ensure inference loadbalance,so DeepSeek-V3 also does not drop tokens during inference.2.2.Multi-Token
71、PredictionInspired by Gloeckle et al.(2024),we investigate and set a Multi-Token Prediction(MTP)objective for DeepSeek-V3,which extends the prediction scope to multiple future tokens at eachposition.On the one hand,an MTP objective densifies the training signals and may improvedata efficiency.On the
72、 other hand,MTP may enable the model to pre-plan its representationsfor better prediction of future tokens.Figure 3 illustrates our implementation of MTP.Differentfrom Gloeckle et al.(2024),which parallelly predictsadditional tokens using independentoutput heads,we sequentially predict additional to
73、kens and keep the complete causal chain ateach prediction depth.We introduce the details of our MTP implementation in this section.MTP Modules.To be specific,our MTP implementation usessequential modules to predictadditional tokens.The-th MTP module consists of a shared embedding layerEmb(),a shared
74、output headOutHead(),a Transformer blockTRM(),and a projection matrix R2.Forthe-th input token,at the-th prediction depth,we first combine the representation of the-thtoken at the(1)-th depth h1 Rand the embedding of the(+)-th token(+)R10with the linear projection:h=RMSNorm(h1);RMSNorm(Emb(+),(21)wh
75、ere;denotes concatenation.Especially,when=1,h1refers to the representation givenby the main model.Note that for each MTP module,its embedding layer is shared with themain model.The combined hserves as the input of the Transformer block at the-th depth toproduce the output representation at the curre
76、nt depth h:h1:=TRM(h1:),(22)whererepresents the input sequence length and:denotes the slicing operation(inclusive ofboth the left and right boundaries).Finally,taking has the input,the shared output head willcompute the probability distribution for the-th additional prediction token+1+R,whereis the
77、vocabulary size:+1=OutHead(h).(23)The output headOutHead()linearly maps the representation to logits and subsequently appliestheSoftmax()function to compute the prediction probabilities of the-th additional token.Also,for each MTP module,its output head is shared with the main model.Our principle of
78、maintaining the causal chain of predictions is similar to that of EAGLE(Li et al.,2024b),but itsprimary objective is speculative decoding(Leviathan et al.,2023;Xia et al.,2023),whereas weutilize MTP to improve training.MTP Training Objective.For each prediction depth,we compute a cross-entropy loss
79、LMTP:LMTP=CrossEntropy(2+:+1,2+:+1)=1+1=2+log,(24)wheredenotes the input sequence length,denotes the ground-truth token at the-th position,anddenotes the corresponding prediction probability of,given by the-th MTP module.Finally,we compute the average of the MTP losses across all depths and multiply
80、 it by aweighting factorto obtain the overall MTP lossLMTP,which serves as an additional trainingobjective for DeepSeek-V3:LMTP=1LMTP.(25)MTP in Inference.Our MTP strategy mainly aims to improve the performance of the mainmodel,so during inference,we can directly discard the MTP modules and the main
81、 model canfunction independently and normally.Additionally,we can also repurpose these MTP modulesfor speculative decoding to further improve the generation latency.3.Infrastructures3.1.Compute ClustersDeepSeek-V3 is trained on a cluster equipped with 2048 NVIDIA H800 GPUs.Each node inthe H800 clust
82、er contains 8 GPUs connected by NVLink and NVSwitch within nodes.Acrossdifferent nodes,InfiniBand(IB)interconnects are utilized to facilitate communications.11ComputationMLP(B)MLP(W)MLP(F)ATTN(B)ATTN(W)ATTN(F)CommunicationDISPATCH(F)DISPATCH(B)COMBINE(F)PPCOMBINE(B)Time Forward chunk Backward chunkF
83、igure 4|Overlapping strategy for a pair of individual forward and backward chunks(theboundaries of the transformer blocks are not aligned).Orange denotes forward,green denotesbackward for input,blue denotes backward for weights,purple denotes PP communication,and red denotes barriers.Both all-to-all
84、 and PP communication can be fully hidden.3.2.Training FrameworkThe training of DeepSeek-V3 is supported by the HAI-LLM framework,an efficient andlightweight training framework crafted by our engineers from the ground up.On the whole,DeepSeek-V3 applies 16-way Pipeline Parallelism(PP)(Qi et al.,2023
85、a),64-way Expert Paral-lelism(EP)(Lepikhin et al.,2021)spanning 8 nodes,and ZeRO-1 Data Parallelism(DP)(Rajb-handari et al.,2020).In order to facilitate efficient training of DeepSeek-V3,we implement meticulous engineeringoptimizations.Firstly,we design the DualPipe algorithm for efficient pipeline
86、parallelism.Compared with existing PP methods,DualPipe has fewer pipeline bubbles.More importantly,itoverlaps the computation and communication phases across forward and backward processes,thereby addressing the challenge of heavy communication overhead introduced by cross-nodeexpert parallelism.Sec
87、ondly,we develop efficient cross-node all-to-all communication kernelsto fully utilize IB and NVLink bandwidths and conserve Streaming Multiprocessors(SMs)dedicated to communication.Finally,we meticulously optimize the memory footprint duringtraining,thereby enabling us to train DeepSeek-V3 without
88、using costly Tensor Parallelism(TP).3.2.1.DualPipe and Computation-Communication OverlapFor DeepSeek-V3,the communication overhead introduced by cross-node expert parallelismresults in an inefficient computation-to-communication ratio of approximately 1:1.To tackle thischallenge,we design an innovat
89、ive pipeline parallelism algorithm called DualPipe,which notonly accelerates model training by effectively overlapping forward and backward computation-communication phases,but also reduces the pipeline bubbles.The key idea of DualPipe is to overlap the computation and communication within a pair of
90、individual forward and backward chunks.To be specific,we divide each chunk into four compo-nents:attention,all-to-all dispatch,MLP,andall-to-all combine.Specially,fora backward chunk,bothattentionandMLPare further split into two parts,backward forinputandbackward for weights,like in ZeroBubble(Qi et
91、 al.,2023b).In addition,wehave aPP communicationcomponent.As illustrated in Figure 4,for a pair of forward andbackward chunks,we rearrange these components and manually adjust the ratio of GPU SMsdedicated to communication versus computation.In this overlapping strategy,we can ensurethat both all-to
92、-all and PP communication can be fully hidden during execution.Given theefficient overlapping strategy,the full DualPipe scheduling is illustrated in Figure 5.It employsa bidirectional pipeline scheduling,which feeds micro-batches from both ends of the pipelinesimultaneously and a significant portio
93、n of communications can be fully overlapped.Thisoverlap also ensures that,as the model further scales up,as long as we maintain a constantcomputation-to-communication ratio,we can still employ fine-grained experts across nodeswhile achieving a near-zero all-to-all communication overhead.12Device 0 0
94、12345670819234566778899Device 101234560718293456787989Device 201234506172839456787989Device 30123405162738495678989Device 40123041526374859678989Device 50120031425364758697899Device 601002113243546576879899Device 700011122233445566778899TimeForwardBackwardBackward for inputBackward for weightsOverla
95、pped forward&BackwardFigure 5|Example DualPipe scheduling for 8 PP ranks and 20 micro-batches in two directions.The micro-batches in the reverse direction are symmetric to those in the forward direction,sowe omit their batch ID for illustration simplicity.Two cells enclosed by a shared black borderh
96、ave mutually overlapped computation and communication.MethodBubbleParameterActivation1F1B(1)(+)1ZB1P(1)(+2)1DualPipe(Ours)(21)(&+3)2+1Table 2|Comparison of pipeline bubbles and memory usage across different pipeline parallelmethods.denotes the execution time of a forward chunk,denotes the execution
97、time of afull backward chunk,denotes the execution time of a backward for weights chunk,and&denotes the execution time of two mutually overlapped forward and backward chunks.In addition,even in more general scenarios without a heavy communication burden,Du-alPipe still exhibits efficiency advantages
98、.In Table 2,we summarize the pipeline bubbles andmemory usage across different PP methods.As shown in the table,compared with ZB1P(Qiet al.,2023b)and 1F1B(Harlap et al.,2018),DualPipe significantly reduces the pipeline bubbleswhile only increasing the peak activation memory by1times.Although DualPip
99、e requireskeeping two copies of the model parameters,this does not significantly increase the memoryconsumption since we use a large EP size during training.Compared with Chimera(Li andHoefler,2021),DualPipe only requires that the pipeline stages and micro-batches be divisible by2,without requiring
100、micro-batches to be divisible by pipeline stages.In addition,for DualPipe,neither the bubbles nor activation memory will increase as the number of micro-batches grows.3.2.2.Efficient Implementation of Cross-Node All-to-All CommunicationIn order to ensure sufficient computational performance for Dual
101、Pipe,we customize efficientcross-node all-to-all communication kernels(including dispatching and combining)to conservethe number of SMs dedicated to communication.The implementation of the kernels is co-designed with the MoE gating algorithm and the network topology of our cluster.To be specific,in
102、our cluster,cross-node GPUs are fully interconnected with IB,and intra-node communicationsare handled via NVLink.NVLink offers a bandwidth of 160 GB/s,roughly 3.2 times that of IB(50 GB/s).To effectively leverage the different bandwidths of IB and NVLink,we limit eachtoken to be dispatched to at mos
103、t 4 nodes,thereby reducing IB traffic.For each token,when itsrouting decision is made,it will first be transmitted via IB to the GPUs with the same in-nodeindex on its target nodes.Once it reaches the target nodes,we will endeavor to ensure that it isinstantaneously forwarded via NVLink to specific
104、GPUs that host their target experts,withoutbeing blocked by subsequently arriving tokens.In this way,communications via IB and NVLinkare fully overlapped,and each token can efficiently select an average of 3.2 experts per nodewithout incurring additional overhead from NVLink.This implies that,althou
105、gh DeepSeek-V313selects only 8 routed experts in practice,it can scale up this number to a maximum of 13 experts(4 nodes3.2 experts/node)while preserving the same communication cost.Overall,undersuch a communication strategy,only 20 SMs are sufficient to fully utilize the bandwidths of IBand NVLink.
106、In detail,we employ the warp specialization technique(Bauer et al.,2014)and partition20 SMs into 10 communication channels.During the dispatching process,(1)IB sending,(2)IB-to-NVLink forwarding,and(3)NVLink receiving are handled by respective warps.Thenumber of warps allocated to each communication
107、 task is dynamically adjusted according to theactual workload across all SMs.Similarly,during the combining process,(1)NVLink sending,(2)NVLink-to-IB forwarding and accumulation,and(3)IB receiving and accumulation are alsohandled by dynamically adjusted warps.In addition,both dispatching and combini
108、ng kernelsoverlap with the computation stream,so we also consider their impact on other SM computationkernels.Specifically,we employ customized PTX(Parallel Thread Execution)instructions andauto-tune the communication chunk size,which significantly reduces the use of the L2 cacheand the interference
109、 to other SMs.3.2.3.Extremely Memory Saving with Minimal OverheadIn order to reduce the memory footprint during training,we employ the following techniques.Recomputation of RMSNorm and MLA Up-Projection.We recompute all RMSNorm op-erations and MLA up-projections during back-propagation,thereby elimi
110、nating the need topersistently store their output activations.With a minor overhead,this strategy significantlyreduces memory requirements for storing activations.Exponential Moving Average in CPU.During training,we preserve the Exponential Mov-ing Average(EMA)of the model parameters for early estim
111、ation of the model performanceafter learning rate decay.The EMA parameters are stored in CPU memory and are updatedasynchronously after each training step.This method allows us to maintain EMA parameterswithout incurring additional memory or time overhead.Shared Embedding and Output Head for Multi-T
112、oken Prediction.With the DualPipe strategy,we deploy the shallowest layers(including the embedding layer)and deepest layers(includingthe output head)of the model on the same PP rank.This arrangement enables the physicalsharing of parameters and gradients,of the shared embedding and output head,betwe
113、en theMTP module and the main model.This physical sharing mechanism further enhances ourmemory efficiency.3.3.FP8 TrainingInspired by recent advances in low-precision training(Dettmers et al.,2022;Noune et al.,2022;Peng et al.,2023b),we propose a fine-grained mixed precision framework utilizing the
114、FP8data format for training DeepSeek-V3.While low-precision training holds great promise,itis often limited by the presence of outliers in activations,weights,and gradients(Fishmanet al.,2024;He et al.;Sun et al.,2024).Although significant progress has been made in in-ference quantization(Frantar et
115、 al.,2022;Xiao et al.,2023),there are relatively few studiesdemonstrating successful application of low-precision techniques in large-scale language model14FpropFP32InputTo FP8BF16WeightDgradFP32Input GradientOutputOutput GradientBF16To FP8WgradFP32To FP8To FP8Weight GradientOptimizer States ToBF16M
116、aster WeightTo FP8To BF16To BF16To FP32或者Input-Activation_LOutput-Activation_L+1FP32Figure 6|The overall mixed precision framework with FP8 data format.For clarification,onlytheLinearoperator is illustrated.pre-training(Fishman et al.,2024).To address this challenge and effectively extend the dynami
117、crange of the FP8 format,we introduce a fine-grained quantization strategy:tile-wise groupingwith 1elements or block-wise grouping withelements.The associated dequantiza-tion overhead is largely mitigated under our increased-precision accumulation process,a criticalaspect for achieving accurate FP8
118、General Matrix Multiplication(GEMM).Moreover,to furtherreduce memory and communication overhead in MoE training,we cache and dispatch activa-tions in FP8,while storing low-precision optimizer states in BF16.We validate the proposed FP8mixed precision framework on two model scales similar to DeepSeek
119、-V2-Lite and DeepSeek-V2,training for approximately 1 trillion tokens(see more details in Appendix B.1).Notably,compared with the BF16 baseline,the relative loss error of our FP8-training model remainsconsistently below 0.25%,a level well within the acceptable range of training randomness.3.3.1.Mixe
120、d Precision FrameworkBuilding upon widely adopted techniques in low-precision training(Kalamkar et al.,2019;Narang et al.,2017),we propose a mixed precision framework for FP8 training.In this frame-work,most compute-density operations are conducted in FP8,while a few key operationsare strategically
121、maintained in their original data formats to balance training efficiency andnumerical stability.The overall framework is illustrated in Figure 6.Firstly,in order to accelerate model training,the majority of core computation kernels,i.e.,GEMM operations,are implemented in FP8 precision.These GEMM ope
122、rations accept FP8tensors as inputs and produce outputs in BF16 or FP32.As depicted in Figure 6,all three GEMMsassociatedwiththeLinearoperator,namelyFprop(forwardpass),Dgrad(activationbackwardpass),andWgrad(weight backward pass),are executed in FP8.This design theoretically doublesthe computational
123、speed compared with the original BF16 method.Additionally,the FP8WgradGEMM allows activations to be stored in FP8 for use in the backward pass.This significantlyreduces memory consumption.Despite the efficiency advantage of the FP8 format,certain operators still require a higherprecision due to thei
124、r sensitivity to low-precision computations.Besides,some low-cost opera-tors can also utilize a higher precision with a negligible overhead to the overall training cost.Forthis reason,after careful investigations,we maintain the original precision(e.g.,BF16 or FP32)for the following components:the e
125、mbedding module,the output head,MoE gating modules,normalization operators,and attention operators.These targeted retentions of high precisionensure stable training dynamics for DeepSeek-V3.To further guarantee numerical stability,westore the master weights,weight gradients,and optimizer states in h
126、igher precision.While15Scaling FactorTensor CoreCUDA CoreInputScaling FactorWeightScaling FactorOutputTensor CoreWGMMA 1WGMMA 4Low Prec AccCUDA CoreFP32 RegisterIntervalOutput/GEMM Input(b)Increasing accumulation precision(a)Fine-grained quantizationFigure 7|(a)We propose a fine-grained quantization
127、 method to mitigate quantization errorscaused by feature outliers;for illustration simplicity,onlyFpropis illustrated.(b)In conjunctionwith our quantization strategy,we improve the FP8 GEMM precision by promoting to CUDACores at an interval of=128 elements MMA for the high-precision accumulation.the
128、se high-precision components incur some memory overheads,their impact can be minimizedthrough efficient sharding across multiple DP ranks in our distributed training system.3.3.2.Improved Precision from Quantization and MultiplicationBased on our mixed precision FP8 framework,we introduce several st
129、rategies to enhance low-precision training accuracy,focusing on both the quantization method and the multiplicationprocess.Fine-Grained Quantization.In low-precision training frameworks,overflows and underflowsare common challenges due to the limited dynamic range of the FP8 format,which is constrai
130、nedby its reduced exponent bits.As a standard practice,the input distribution is aligned to therepresentable range of the FP8 format by scaling the maximum absolute value of the inputtensor to the maximum representable value of FP8(Narang et al.,2017).This method makes low-precision training highly
131、sensitive to activation outliers,which can heavily degrade quantizationaccuracy.To solve this,we propose a fine-grained quantization method that applies scalingat a more granular level.As illustrated in Figure 7(a),(1)for activations,we group andscale elements on a1x128tile basis(i.e.,per token per
132、128 channels);and(2)for weights,wegroup and scale elements on a128x128block basis(i.e.,per 128 input channels per 128 outputchannels).This approach ensures that the quantization process can better accommodate outliersby adapting the scale according to smaller groups of elements.In Appendix B.2,we fu
133、rtherdiscuss the training instability when we group and scale activations on a block basis in the sameway as weights quantization.One key modification in our method is the introduction of per-group scaling factors alongthe inner dimension of GEMM operations.This functionality is not directly support
134、ed in thestandard FP8 GEMM.However,combined with our precise FP32 accumulation strategy,it can16be efficiently implemented.Notably,our fine-grained quantization strategy is highly consistent with the idea of mi-croscaling formats(Rouhani et al.,2023b),while the Tensor Cores of NVIDIA next-generation
135、GPUs(Blackwell series)have announced the support for microscaling formats with smallerquantization granularity(NVIDIA,2024a).We hope our design can serve as a reference forfuture work to keep pace with the latest GPU architectures.Increasing Accumulation Precision.Low-precision GEMM operations often
136、 suffer from un-derflow issues,and their accuracy largely depends on high-precision accumulation,which iscommonly performed in an FP32 precision(Kalamkar et al.,2019;Narang et al.,2017).However,we observe that the accumulation precision of FP8 GEMM on NVIDIA H800 GPUs is limited toretaining around 1
137、4 bits,which is significantly lower than FP32 accumulation precision.Thisproblem will become more pronounced when the inner dimensionKis large(Wortsman et al.,2023),a typical scenario in large-scale model training where the batch size and model widthare increased.Taking GEMM operations of two random
138、 matrices withK=4096 for example,inour preliminary test,the limited accumulation precision in Tensor Cores results in a maximumrelative error of nearly 2%.Despite these problems,the limited accumulation precision is stillthe default option in a few FP8 frameworks(NVIDIA,2024b),severely constraining
139、the trainingaccuracy.In order to address this issue,we adopt the strategy of promotion to CUDA Cores forhigher precision(Thakkar et al.,2023).The process is illustrated in Figure 7(b).To be specific,during MMA(Matrix Multiply-Accumulate)execution on Tensor Cores,intermediate resultsare accumulated u
140、sing the limited bit width.Once an interval ofis reached,these partialresults will be copied to FP32 registers on CUDA Cores,where full-precision FP32 accumulationis performed.As mentioned before,our fine-grained quantization applies per-group scalingfactors along the inner dimensionK.These scaling
141、factors can be efficiently multiplied on theCUDA Cores as the dequantization process with minimal additional computational cost.It is worth noting that this modification reduces the WGMMA(Warpgroup-level MatrixMultiply-Accumulate)instruction issue rate for a single warpgroup.However,on the H800archi
142、tecture,it is typical for two WGMMA to persist concurrently:while one warpgroupperforms the promotion operation,the other is able to execute the MMA operation.This designenables overlapping of the two operations,maintaining high utilization of Tensor Cores.Basedon our experiments,setting=128 element
143、s,equivalent to 4 WGMMAs,represents theminimal accumulation interval that can significantly improve precision without introducingsubstantial overhead.Mantissa over Exponents.In contrast to the hybrid FP8 format adopted by prior work(NVIDIA,2024b;Peng et al.,2023b;Sun et al.,2019b),which usesE4M3(4-b
144、it exponent and3-bit mantissa)inFpropandE5M2(5-bit exponent and 2-bit mantissa)inDgradandWgrad,we adopt theE4M3format on all tensors for higher precision.We attribute the feasibility ofthis approach to our fine-grained quantization strategy,i.e.,tile and block-wise scaling.Byoperating on smaller ele
145、ment groups,our methodology effectively shares exponent bits amongthese grouped elements,mitigating the impact of the limited dynamic range.Online Quantization.Delayed quantization is employed in tensor-wise quantization frame-works(NVIDIA,2024b;Peng et al.,2023b),which maintains a history of the ma
146、ximum absolute17values across prior iterations to infer the current value.In order to ensure accurate scales andsimplify the framework,we calculate the maximum absolute value online for each1x128acti-vation tile or128x128weight block.Based on it,we derive the scaling factor and then quantizethe acti
147、vation or weight online into the FP8 format.3.3.3.Low-Precision Storage and CommunicationIn conjunction with our FP8 training framework,we further reduce the memory consumptionand communication overhead by compressing cached activations and optimizer states intolower-precision formats.Low-Precision
148、Optimizer States.We adopt the BF16 data format instead of FP32 to track thefirst and second moments in the AdamW(Loshchilov and Hutter,2017)optimizer,withoutincurring observable performance degradation.However,the master weights(stored by theoptimizer)and gradients(used for batch size accumulation)a
149、re still retained in FP32 to ensurenumerical stability throughout training.Low-Precision Activation.As illustrated in Figure 6,theWgradoperation is performed in FP8.To reduce the memory consumption,it is a natural choice to cache activations in FP8 formatfor the backward pass of theLinearoperator.Ho
150、wever,special considerations are taken onseveral operators for low-cost high-precision training:(1)Inputs of theLinearafter the attention operator.These activations are alsoused in the backward pass of the attention operator,which makes it sensitive toprecision.We adopt a customizedE5M6data format e
151、xclusively for these activations.Additionally,these activations will be converted from an1x128quantization tile toan128x1tile in the backward pass.To avoid introducing extra quantization error,all the scaling factors are round scaled,i.e.,integral power of 2.(2)Inputs of the SwiGLU operator in MoE.T
152、o further reduce the memory cost,wecache the inputs of the SwiGLU operator and recompute its output in the backwardpass.These activations are also stored in FP8 with our fine-grained quantizationmethod,striking a balance between memory efficiency and computational accuracy.Low-Precision Communicatio
153、n.Communication bandwidth is a critical bottleneck in thetraining of MoE models.To alleviate this challenge,we quantize the activation before MoEup-projections into FP8 and then applydispatchcomponents,which is compatible withFP8Fpropin MoE up-projections.Like the inputs of theLinearafter the attent
154、ion operator,scaling factors for this activation are integral power of 2.A similar strategy is applied to theactivation gradient before MoE down-projections.For both the forward and backwardcombinecomponents,we retain them in BF16 to preserve training precision in critical parts of the trainingpipel
155、ine.3.4.Inference and DeploymentWe deploy DeepSeek-V3 on the H800 cluster,where GPUs within each node are interconnectedusing NVLink,and all GPUs across the cluster are fully interconnected via IB.To simultaneouslyensure both the Service-Level Objective(SLO)for online services and high throughput,we
156、employ the following deployment strategy that separates the prefilling and decoding stages.183.4.1.PrefillingThe minimum deployment unit of the prefilling stage consists of 4 nodes with 32 GPUs.Theattentionpart employs 4-way Tensor Parallelism(TP4)with Sequence Parallelism(SP),com-bined with 8-way D
157、ata Parallelism(DP8).Its small TP size of 4 limits the overhead of TPcommunication.For theMoEpart,we use 32-way Expert Parallelism(EP32),which ensures thateach expert processes a sufficiently large batch size,thereby enhancing computational efficiency.For theMoEall-to-all communication,we use the sa
158、me method as in training:first transferringtokens across nodes via IB,and then forwarding among the intra-node GPUs via NVLink.Inparticular,we use 1-way Tensor Parallelism for the dense MLPs in shallow layers to save TPcommunication.To achieve load balancing among different experts in theMoEpart,we
159、need to ensure thateach GPU processes approximately the same number of tokens.To this end,we introduce adeployment strategy of redundant experts,which duplicates high-load experts and deploys themredundantly.The high-load experts are detected based on statistics collected during the onlinedeployment
160、 and are adjusted periodically(e.g.,every 10 minutes).After determining the setof redundant experts,we carefully rearrange experts among GPUs within a node based on theobserved loads,striving to balance the load across GPUs as much as possible without increasingthe cross-node all-to-all communicatio
161、n overhead.For the deployment of DeepSeek-V3,we set32 redundant experts for the prefilling stage.For each GPU,besides the original 8 experts ithosts,it will also host one additional redundant expert.Furthermore,in the prefilling stage,to improve the throughput and hide the overhead ofall-to-all and
162、TP communication,we simultaneously process two micro-batches with similarcomputational workloads,overlapping theattentionandMoEof one micro-batch with thedispatchandcombineof another.Finally,we are exploring a dynamic redundancy strategy for experts,where each GPU hostsmore experts(e.g.,16 experts),
163、but only 9 will be activated during each inference step.Beforethe all-to-all operation at each layer begins,we compute the globally optimal routing schemeon the fly.Given the substantial computation involved in the prefilling stage,the overhead ofcomputing this routing scheme is almost negligible.3.
164、4.2.DecodingDuring decoding,we treat the shared expert as a routed one.From this perspective,each tokenwill select 9 experts during routing,where the shared expert is regarded as a heavy-load onethat will always be selected.The minimum deployment unit of the decoding stage consistsof 40 nodes with 3
165、20 GPUs.Theattentionpart employs TP4 with SP,combined with DP80,while theMoEpart uses EP320.For theMoEpart,each GPU hosts only one expert,and 64 GPUsare responsible for hosting redundant experts and shared experts.All-to-all communicationof thedispatchandcombineparts is performed via direct point-to
166、-point transfers over IB toachieve low latency.Additionally,we leverage the IBGDA(NVIDIA,2022)technology to furtherminimize latency and enhance communication efficiency.Similar to prefilling,we periodically determine the set of redundant experts in a certaininterval,based on the statistical expert l
167、oad from our online service.However,we do not needto rearrange experts since each GPU only hosts one expert.We are also exploring the dynamicredundancy strategy for decoding.However,this requires more careful optimization of thealgorithm that computes the globally optimal routing scheme and the fusi
168、on with thedispatchkernel to reduce overhead.19Additionally,to enhance throughput and hide the overhead of all-to-all communication,we are also exploring processing two micro-batches with similar computational workloadssimultaneously in the decoding stage.Unlike prefilling,attentionconsumes a larger
169、 portionof time in the decoding stage.Therefore,we overlap theattentionof one micro-batch withthedispatch+MoE+combineof another.In the decoding stage,the batch size per expertis relatively small(usually within 256 tokens),and the bottleneck is memory access ratherthan computation.Since theMoEpart on
170、ly needs to load the parameters of one expert,thememory access overhead is minimal,so using fewer SMs will not significantly affect the overallperformance.Therefore,to avoid impacting the computation speed of theattentionpart,wecan allocate only a small portion of SMs todispatch+MoE+combine.3.5.Sugg
171、estions on Hardware DesignBased on our implementation of the all-to-all communication and FP8 training scheme,wepropose the following suggestions on chip design to AI hardware vendors.3.5.1.Communication HardwareIn DeepSeek-V3,we implement the overlap between computation and communication to hidethe
172、 communication latency during computation.This significantly reduces the dependencyon communication bandwidth compared to serial computation and communication.However,the current communication implementation relies on expensive SMs(e.g.,we allocate 20 out ofthe 132 SMs available in the H800 GPU for
173、this purpose),which will limit the computationalthroughput.Moreover,using SMs for communication results in significant inefficiencies,astensor cores remain entirely under-utilized.Currently,the SMs primarily perform the following tasks for all-to-all communication:Forwarding data between the IB(Infi
174、niBand)and NVLink domain while aggregating IBtraffic destined for multiple GPUs within the same node from a single GPU.Transporting data between RDMA buffers(registered GPU memory regions)and in-put/output buffers.Executingreduceoperations forall-to-all combine.Managing fine-grained memory layout du
175、ring chunked data transferring to multipleexperts across the IB and NVLink domain.We aspire to see future vendors developing hardware that offloads these communicationtasks from the valuable computation unit SM,serving as a GPU co-processor or a networkco-processor like NVIDIA SHARP Graham et al.(20
176、16).Furthermore,to reduce applicationprogramming complexity,we aim for this hardware to unify the IB(scale-out)and NVLink(scale-up)networks from the perspective of the computation units.With this unified interface,computation units can easily accomplish operations such asread,write,multicast,andredu
177、ceacross the entire IB-NVLink-unified domain via submitting communication requestsbased on simple primitives.3.5.2.Compute HardwareHigher FP8 GEMM Accumulation Precision in Tensor Cores.In the current Tensor Coreimplementation of the NVIDIA Hopper architecture,FP8 GEMM(General Matrix Multiply)employ
178、s fixed-point accumulation,aligning the mantissa products by right-shifting based onthe maximum exponent before addition.Our experiments reveal that it only uses the highest 1420bits of each mantissa product after sign-fill right shifting,and truncates bits exceeding this range.However,for example,t
179、o achieve precise FP32 results from the accumulation of 32 FP8FP8multiplications,at least 34-bit precision is required.Thus,we recommend that future chipdesigns increase accumulation precision in Tensor Cores to support full-precision accumulation,or select an appropriate accumulation bit-width acco
180、rding to the accuracy requirements oftraining and inference algorithms.This approach ensures that errors remain within acceptablebounds while maintaining computational efficiency.Support for Tile-and Block-Wise Quantization.Current GPUs only support per-tensorquantization,lacking the native support
181、for fine-grained quantization like our tile-and block-wise quantization.In the current implementation,when theinterval is reached,the partialresults will be copied from Tensor Cores to CUDA cores,multiplied by the scaling factors,andadded to FP32 registers on CUDA cores.Although the dequantization o
182、verhead is significantlymitigated combined with our precise FP32 accumulation strategy,the frequent data movementsbetween Tensor Cores and CUDA cores still limit the computational efficiency.Therefore,werecommend future chips to support fine-grained quantization by enabling Tensor Cores toreceive sc
183、aling factors and implement MMA with group scaling.In this way,the whole partialsum accumulation and dequantization can be completed directly inside Tensor Cores until thefinal result is produced,avoiding frequent data movements.Support for Online Quantization.The current implementations struggle to
184、 effectively supportonline quantization,despite its effectiveness demonstrated in our research.In the existingprocess,we need to read 128 BF16 activation values(the output of the previous computation)from HBM(High Bandwidth Memory)for quantization,and the quantized FP8 values arethen written back to
185、 HBM,only to be read again for MMA.To address this inefficiency,werecommend that future chips integrate FP8 cast and TMA(Tensor Memory Accelerator)accessinto a single fused operation,so quantization can be completed during the transfer of activationsfrom global memory to shared memory,avoiding frequ
186、ent memory reads and writes.We alsorecommend supporting a warp-level cast instruction for speedup,which further facilitates thebetter fusion of layer normalization and FP8 cast.Alternatively,a near-memory computingapproach can be adopted,where compute logic is placed near the HBM.In this case,BF16el
187、ements can be cast to FP8 directly as they are read from HBM into the GPU,reducing off-chipmemory access by roughly 50%.Support for Transposed GEMM Operations.The current architecture makes it cumbersometo fuse matrix transposition with GEMM operations.In our workflow,activations during theforward p
188、ass are quantized into1x128FP8 tiles and stored.During the backward pass,thematrix needs to be read out,dequantized,transposed,re-quantized into128x1tiles,and storedin HBM.To reduce memory operations,we recommend future chips to enable direct transposedreads of matrices from shared memory before MMA
189、 operation,for those precisions requiredin both training and inference.Combined with the fusion of FP8 format conversion and TMAaccess,this enhancement will significantly streamline the quantization workflow.214.Pre-Training4.1.Data ConstructionCompared with DeepSeek-V2,we optimize the pre-training
190、corpus by enhancing the ratioof mathematical and programming samples,while expanding multilingual coverage beyondEnglish and Chinese.Also,our data processing pipeline is refined to minimize redundancywhile maintaining corpus diversity.Inspired by Ding et al.(2024),we implement the documentpacking me
191、thod for data integrity but do not incorporate cross-sample attention masking duringtraining.Finally,the training corpus for DeepSeek-V3 consists of 14.8T high-quality and diversetokens in our tokenizer.In the training process of DeepSeekCoder-V2(DeepSeek-AI,2024a),we observe that theFill-in-Middle(
192、FIM)strategy does not compromise the next-token prediction capability whileenabling the model to accurately predict middle text based on contextual cues.In alignment withDeepSeekCoder-V2,we also incorporate the FIM strategy in the pre-training of DeepSeek-V3.Tobe specific,we employ the Prefix-Suffix
193、-Middle(PSM)framework to structure data as follows:presufmiddle.This structure is applied at the document level as a part of the pre-packing process.The FIMstrategy is applied at a rate of 0.1,consistent with the PSM framework.ThetokenizerforDeepSeek-V3employsByte-levelBPE(Shibataetal.,1999)withanex
194、tendedvocabulary of 128K tokens.The pretokenizer and training data for our tokenizer are modifiedto optimize multilingual compression efficiency.In addition,compared with DeepSeek-V2,the new pretokenizer introduces tokens that combine punctuations and line breaks.However,this trick may introduce the
195、 token boundary bias(Lundberg,2023)when the model processesmulti-line prompts without terminal line breaks,particularly for few-shot evaluation prompts.To address this issue,we randomly split a certain proportion of such combined tokens duringtraining,which exposes the model to a wider array of spec
196、ial cases and mitigates this bias.4.2.Hyper-ParametersModel Hyper-Parameters.We set the number of Transformer layers to 61 and the hiddendimension to 7168.All learnable parameters are randomly initialized with a standard deviationof 0.006.In MLA,we set the number of attention headsto 128 and the per
197、-head dimensionto 128.The KV compression dimensionis set to 512,and the query compression dimensionis set to 1536.For the decoupled queries and key,we set the per-head dimensionto 64.Wesubstitute all FFNs except for the first three layers with MoE layers.Each MoE layer consists of 1shared expert and
198、 256 routed experts,where the intermediate hidden dimension of each expertis 2048.Among the routed experts,8 experts will be activated for each token,and each tokenwill be ensured to be sent to at most 4 nodes.The multi-token prediction depthis set to 1,i.e.,besides the exact next token,each token w
199、ill predict one additional token.As DeepSeek-V2,DeepSeek-V3 also employs additional RMSNorm layers after the compressed latent vectors,and multiplies additional scaling factors at the width bottlenecks.Under this configuration,DeepSeek-V3 comprises 671B total parameters,of which 37B are activated fo
200、r each token.Training Hyper-Parameters.We employ the AdamW optimizer(Loshchilov and Hutter,2017)with hyper-parameters set to1=0.9,2=0.95,andweight_decay=0.1.We set the maximumsequence length to 4K during pre-training,and pre-train DeepSeek-V3 on 14.8T tokens.As for22the learning rate scheduling,we f
201、irst linearly increase it from 0 to 2.2104during the first2K steps.Then,we keep a constant learning rate of 2.2104until the model consumes 10Ttraining tokens.Subsequently,we gradually decay the learning rate to 2.2105in 4.3T tokens,following a cosine decay curve.During the training of the final 500B
202、 tokens,we keep a constantlearning rate of 2.2105in the first 333B tokens,and switch to another constant learning rateof 7.3106in the remaining 167B tokens.The gradient clipping norm is set to 1.0.We employa batch size scheduling strategy,where the batch size is gradually increased from 3072 to 1536
203、0in the training of the first 469B tokens,and then keeps 15360 in the remaining training.Weleverage pipeline parallelism to deploy different layers of a model on different GPUs,and foreach layer,the routed experts will be uniformly deployed on 64 GPUs belonging to 8 nodes.As for the node-limited rou
204、ting,each token will be sent to at most 4 nodes(i.e.,=4).Forauxiliary-loss-free load balancing,we set the bias update speedto 0.001 for the first 14.3Ttokens,and to 0.0 for the remaining 500B tokens.For the balance loss,we setto 0.0001,just toavoid extreme imbalance within any single sequence.The MT
205、P loss weightis set to 0.3 for thefirst 10T tokens,and to 0.1 for the remaining 4.8T tokens.2K11K20K29K38K47K56K65K74K83K92K 101K 110K 119K 128KContext Length(#Tokens)07142129364350576471798693100Document Depth Percent(%)Pressure Testing DeepSeek-V3 128K Context via Needle In A HayStack12345678910Sc
206、oreFigure 8|Evaluation results on the”Needle In A Haystack”(NIAH)tests.DeepSeek-V3performs well across all context window lengths up to 128K.4.3.Long Context ExtensionWe adopt a similar approach to DeepSeek-V2(DeepSeek-AI,2024c)to enable long contextcapabilities in DeepSeek-V3.After the pre-training
207、 stage,we apply YaRN(Peng et al.,2023a)for context extension and perform two additional training phases,each comprising 1000 steps,to progressively expand the context window from 4K to 32K and then to 128K.The YaRNconfiguration is consistent with that used in DeepSeek-V2,being applied exclusively to
208、 thedecoupled shared key k.The hyper-parameters remain identical across both phases,with thescale=40,=1,=32,and the scaling factor=0.1ln+1.In the first phase,the sequencelength is set to 32K,and the batch size is 1920.During the second phase,the sequence length isincreased to 128K,and the batch size
209、 is reduced to 480.The learning rate for both phases is setto 7.3106,matching the final learning rate from the pre-training stage.23Through this two-phase extension training,DeepSeek-V3 is capable of handling inputs up to128K in length while maintaining strong performance.Figure 8 illustrates that D
210、eepSeek-V3,following supervised fine-tuning,achieves notable performance on the Needle In A Haystack(NIAH)test,demonstrating consistent robustness across context window lengths up to 128K.4.4.Evaluations4.4.1.Evaluation BenchmarksThe base model of DeepSeek-V3 is pretrained on a multilingual corpus w
211、ith English and Chineseconstituting the majority,so we evaluate its performance on a series of benchmarks primarilyin English and Chinese,as well as on a multilingual benchmark.Our evaluation is basedon our internal evaluation framework integrated in our HAI-LLM framework.Consideredbenchmarks are ca
212、tegorized and listed as follows,whereunderlinedbenchmarks are in Chineseand double-underlined benchmarks are multilingual ones:Multi-subject multiple-choice datasets include MMLU(Hendrycks et al.,2020),MMLU-Redux(Gema et al.,2024),MMLU-Pro(Wang et al.,2024b),MMMLU(OpenAI,2024b),C-Eval(Huang et al.,2
213、023),and CMMLU(Li et al.,2023).Language understanding and reasoning datasets include HellaSwag(Zellers et al.,2019),PIQA(Bisk et al.,2020),ARC(Clark et al.,2018),and BigBench Hard(BBH)(Suzgun et al.,2022).Closed-book question answering datasets include TriviaQA(Joshi et al.,2017)and Natu-ralQuestion
214、s(Kwiatkowski et al.,2019).Reading comprehension datasets include RACE Lai et al.(2017),DROP(Dua et al.,2019),C3(Sun et al.,2019a),and CMRC(Cui et al.,2019).Reference disambiguation datasets includeCLUEWSC(Xu et al.,2020)and WinoGrandeSakaguchi et al.(2019).Language modeling datasets include Pile(Ga
215、o et al.,2020).Chinese understanding and culture datasets include CCPM(Li et al.,2021).Math datasets include GSM8K(Cobbe et al.,2021),MATH(Hendrycks et al.,2021),MGSM(Shi et al.,2023),and CMath(Wei et al.,2023).Code datasets include HumanEval(Chen et al.,2021),LiveCodeBench-Base(0801-1101)(Jainet al
216、.,2024),MBPP(Austin et al.,2021),and CRUXEval(Gu et al.,2024).Standardized exams includeAGIEval(Zhong et al.,2023).Note that AGIEval includes bothEnglish and Chinese subsets.Following our previous work(DeepSeek-AI,2024b,c),we adopt perplexity-based eval-uation for datasets including HellaSwag,PIQA,W
217、inoGrande,RACE-Middle,RACE-High,MMLU,MMLU-Redux,MMLU-Pro,MMMLU,ARC-Easy,ARC-Challenge,C-Eval,CMMLU,C3,and CCPM,and adopt generation-based evaluation for TriviaQA,NaturalQuestions,DROP,MATH,GSM8K,MGSM,HumanEval,MBPP,LiveCodeBench-Base,CRUXEval,BBH,AGIEval,CLUEWSC,CMRC,and CMath.In addition,we perform
218、 language-modeling-based evaluationfor Pile-test and use Bits-Per-Byte(BPB)as the metric to guarantee fair comparison amongmodels using different tokenizers.24Benchmark(Metric)#ShotsDeepSeek-V2Qwen2.5LLaMA-3.1DeepSeek-V3Base72B Base405B BaseBaseArchitecture-MoEDenseDenseMoE#Activated Params-21B72B40
219、5B37B#Total Params-236B72B405B671BEnglishPile-test(BPB)-0.6060.6380.5420.548BBH(EM)3-shot78.879.882.987.5MMLU(EM)5-shot78.485.084.487.1MMLU-Redux(EM)5-shot75.683.281.386.2MMLU-Pro(EM)5-shot51.458.352.864.4DROP(F1)3-shot80.480.686.089.0ARC-Easy(EM)25-shot97.698.498.498.9ARC-Challenge(EM)25-shot92.294
220、.595.395.3HellaSwag(EM)10-shot87.184.889.288.9PIQA(EM)0-shot83.982.685.984.7WinoGrande(EM)5-shot86.382.385.284.9RACE-Middle(EM)5-shot73.168.174.267.1RACE-High(EM)5-shot52.650.356.851.3TriviaQA(EM)5-shot80.071.982.782.9NaturalQuestions(EM)5-shot38.633.241.540.0AGIEval(EM)0-shot57.575.860.679.6CodeHum
221、anEval(Pass1)0-shot43.353.054.965.2MBPP(Pass1)3-shot65.072.668.475.4LiveCodeBench-Base(Pass1)3-shot11.612.915.519.4CRUXEval-I(EM)2-shot52.559.158.567.3CRUXEval-O(EM)2-shot49.859.959.969.8MathGSM8K(EM)8-shot81.688.383.589.3MATH(EM)4-shot43.454.449.061.6MGSM(EM)8-shot63.676.269.979.8CMath(EM)3-shot78.
222、784.577.390.7ChineseCLUEWSC(EM)5-shot82.082.583.082.7C-Eval(EM)5-shot81.489.272.590.1CMMLU(EM)5-shot84.089.573.788.8CMRC(EM)1-shot77.475.876.076.3C3(EM)0-shot77.476.779.778.6CCPM(EM)0-shot93.088.578.692.0MultilingualMMMLU-non-English(EM)5-shot64.074.873.879.4Table 3|Comparison among DeepSeek-V3-Base
223、 and other representative open-source basemodels.All models are evaluated in our internal framework and share the same evaluationsetting.Scores with a gap not exceeding 0.3 are considered to be at the same level.DeepSeek-V3-Base achieves the best performance on most benchmarks,especially on math and
224、 code tasks.4.4.2.Evaluation ResultsIn Table 3,we compare the base model of DeepSeek-V3 with the state-of-the-art open-source basemodels,including DeepSeek-V2-Base(DeepSeek-AI,2024c)(our previous release),Qwen2.5 72BBase(Qwen,2024b),and LLaMA-3.1 405B Base(AIMeta,2024b).We evaluate all these modelsw
225、ith our internal evaluation framework,and ensure that they share the same evaluation setting.Note that due to the changes in our evaluation framework over the past months,the performanceof DeepSeek-V2-Base exhibits a slight difference from our previously reported results.Overall,DeepSeek-V3-Base com
226、prehensively outperforms DeepSeek-V2-Base and Qwen2.5 72B Base,and surpasses LLaMA-3.1 405B Base in the majority of benchmarks,essentially becoming thestrongest open-source model.25Fromamoredetailedperspective,wecompareDeepSeek-V3-Basewiththeotheropen-sourcebase models individually.(1)Compared with
227、DeepSeek-V2-Base,due to the improvements inour model architecture,the scale-up of the model size and training tokens,and the enhancementof data quality,DeepSeek-V3-Base achieves significantly better performance as expected.(2)Compared with Qwen2.5 72B Base,the state-of-the-art Chinese open-source mo
228、del,with onlyhalf of the activated parameters,DeepSeek-V3-Base also demonstrates remarkable advantages,especially on English,multilingual,code,and math benchmarks.As for Chinese benchmarks,except for CMMLU,a Chinese multi-subject multiple-choice task,DeepSeek-V3-Base also showsbetter performance tha
229、n Qwen2.5 72B.(3)Compared with LLaMA-3.1 405B Base,the largestopen-source model with 11 times the activated parameters,DeepSeek-V3-Base also exhibitsmuch better performance on multilingual,code,and math benchmarks.As for English andChinese language benchmarks,DeepSeek-V3-Base shows competitive or be
230、tter performance,and is especially good on BBH,MMLU-series,DROP,C-Eval,CMMLU,and CCPM.Due to our efficient architectures and comprehensive engineering optimizations,DeepSeek-V3 achieves extremely high training efficiency.Under our training framework and infrastruc-tures,training DeepSeek-V3 on each
231、trillion tokens requires only 180K H800 GPU hours,whichis much cheaper than training 72B or 405B dense models.Benchmark(Metric)#ShotsSmall MoESmall MoELarge MoELarge MoEBaselinew/MTPBaselinew/MTP#Activated Params(Inference)-2.4B2.4B20.9B20.9B#Total Params(Inference)-15.7B15.7B228.7B228.7B#Training T
232、okens-1.33T1.33T540B540BPile-test(BPB)-0.7290.7290.6580.657BBH(EM)3-shot39.041.470.070.7MMLU(EM)5-shot50.053.367.566.6DROP(F1)1-shot39.241.368.570.6TriviaQA(EM)5-shot56.957.767.067.3NaturalQuestions(EM)5-shot22.722.327.228.5HumanEval(Pass1)0-shot20.726.844.553.7MBPP(Pass1)3-shot35.836.861.662.2GSM8K
233、(EM)8-shot25.431.472.374.0MATH(EM)4-shot10.712.638.639.8Table 4|Ablation results for the MTP strategy.The MTP strategy consistently enhances themodel performance on most of the evaluation benchmarks.4.5.Discussion4.5.1.Ablation Studies for Multi-Token PredictionIn Table 4,we show the ablation result
234、s for the MTP strategy.To be specific,we validate theMTP strategy on top of two baseline models across different scales.At the small scale,we traina baseline MoE model comprising 15.7B total parameters on 1.33T tokens.At the large scale,we train a baseline MoE model comprising 228.7B total parameter
235、s on 540B tokens.On topof them,keeping the training data and the other architectures the same,we append a 1-depthMTP module onto them and train two models with the MTP strategy for comparison.Note thatduring inference,we directly discard the MTP module,so the inference costs of the comparedmodels ar
236、e exactly the same.From the table,we can observe that the MTP strategy consistentlyenhances the model performance on most of the evaluation benchmarks.26Benchmark(Metric)#ShotsSmall MoESmall MoELarge MoELarge MoEAux-Loss-BasedAux-Loss-FreeAux-Loss-BasedAux-Loss-Free#Activated Params-2.4B2.4B20.9B20.
237、9B#Total Params-15.7B15.7B228.7B228.7B#Training Tokens-1.33T1.33T578B578BPile-test(BPB)-0.7270.7240.6560.652BBH(EM)3-shot37.339.366.767.9MMLU(EM)5-shot51.051.868.367.2DROP(F1)1-shot38.139.067.167.1TriviaQA(EM)5-shot58.358.566.767.7NaturalQuestions(EM)5-shot23.223.427.128.1HumanEval(Pass1)0-shot22.02
238、2.640.246.3MBPP(Pass1)3-shot36.635.859.261.2GSM8K(EM)8-shot27.129.670.774.5MATH(EM)4-shot10.911.137.239.6Table 5|Ablation results for the auxiliary-loss-free balancing strategy.Compared with thepurely auxiliary-loss-based method,the auxiliary-loss-free strategy consistently achieves bettermodel perf
239、ormance on most of the evaluation benchmarks.4.5.2.Ablation Studies for the Auxiliary-Loss-Free Balancing StrategyIn Table 5,we show the ablation results for the auxiliary-loss-free balancing strategy.Wevalidate this strategy on top of two baseline models across different scales.At the small scale,w
240、e train a baseline MoE model comprising 15.7B total parameters on 1.33T tokens.At thelarge scale,we train a baseline MoE model comprising 228.7B total parameters on 578B tokens.Both of the baseline models purely use auxiliary losses to encourage load balance,and use thesigmoid gating function with t
241、op-K affinity normalization.Their hyper-parameters to controlthe strength of auxiliary losses are the same as DeepSeek-V2-Lite and DeepSeek-V2,respectively.On top of these two baseline models,keeping the training data and the other architectures thesame,we remove all auxiliary losses and introduce t
242、he auxiliary-loss-free balancing strategy forcomparison.From the table,we can observe that the auxiliary-loss-free strategy consistentlyachieves better model performance on most of the evaluation benchmarks.4.5.3.Batch-Wise Load Balance VS.Sequence-Wise Load BalanceThe key distinction between auxili
243、ary-loss-free balancing and sequence-wise auxiliary loss liesin their balancing scope:batch-wise versus sequence-wise.Compared with the sequence-wiseauxiliary loss,batch-wise balancing imposes a more flexible constraint,as it does not enforcein-domain balance on each sequence.This flexibility allows
244、 experts to better specialize indifferent domains.To validate this,we record and analyze the expert load of a 16B auxiliary-loss-based baseline and a 16B auxiliary-loss-free model on different domains in the Pile test set.As illustrated in Figure 9,we observe that the auxiliary-loss-free model demon
245、strates greaterexpert specialization patterns as expected.To further investigate the correlation between this flexibility and the advantage in modelperformance,we additionally design and validate a batch-wise auxiliary loss that encouragesload balance on each training batch instead of on each sequen
246、ce.The experimental results showthat,when achieving a similar level of batch-wise load balance,the batch-wise auxiliary losscan also achieve similar model performance to the auxiliary-loss-free method.To be specific,in our experiments with 1B MoE models,the validation losses are:2.258(using a sequen
247、ce-wise auxiliary loss),2.253(using the auxiliary-loss-free method),and 2.253(using a batch-wise271 2 3 4 5 6 7 8 9 10 11 1213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364Wikipedia(en)GithubDM MathematicsAux-Loss-Based Layer 91 2 3 4 5 6 7 8 9
248、 10 11 1213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364Wikipedia(en)GithubDM MathematicsAux-Loss-Free Layer 91 2 3 4 5 6 7 8 9 10 11 1213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364Wik
249、ipedia(en)GithubDM MathematicsAux-Loss-Based Layer 181 2 3 4 5 6 7 8 9 10 11 1213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364Wikipedia(en)GithubDM MathematicsAux-Loss-Free Layer 180246810Relative Expert LoadFigure 9|Expert load of auxiliary-l
250、oss-free and auxiliary-loss-based models on three domains inthe Pile test set.The auxiliary-loss-free model shows greater expert specialization patterns thanthe auxiliary-loss-based one.The relative expert load denotes the ratio between the actual expertload and the theoretically balanced expert loa
251、d.Due to space constraints,we only present theresults of two layers as an example,with the results of all layers provided in Appendix C.auxiliary loss).We also observe similar results on 3B MoE models:the model using a sequence-wise auxiliary loss achieves a validation loss of 2.085,and the models u
252、sing the auxiliary-loss-freemethod or a batch-wise auxiliary loss achieve the same validation loss of 2.080.In addition,although the batch-wise load balancing methods show consistent performanceadvantages,they also face two potential challenges in efficiency:(1)load imbalance withincertain sequences
253、 or small batches,and(2)domain-shift-induced load imbalance during infer-ence.The first challenge is naturally addressed by our training framework that uses large-scaleexpert parallelism and data parallelism,which guarantees a large size of each micro-batch.Forthe second challenge,we also design and
254、 implement an efficient inference framework withredundant expert deployment,as described in Section 3.4,to overcome it.5.Post-Training5.1.Supervised Fine-TuningWe curate our instruction-tuning datasets to include 1.5M instances spanning multiple domains,with each domain employing distinct data creat
255、ion methods tailored to its specific requirements.Reasoning Data.For reasoning-related datasets,including those focused on mathematics,code competition problems,and logic puzzles,we generate the data by leveraging an internalDeepSeek-R1 model.Specifically,while the R1-generated data demonstrates str
256、ong accuracy,itsuffers from issues such as overthinking,poor formatting,and excessive length.Our objective isto balance the high accuracy of R1-generated reasoning data and the clarity and conciseness ofregularly formatted reasoning data.28To establish our methodology,we begin by developing an exper
257、t model tailored to a specificdomain,such as code,mathematics,or general reasoning,using a combined Supervised Fine-Tuning(SFT)and Reinforcement Learning(RL)training pipeline.This expert model serves as adata generator for the final model.The training process involves generating two distinct typesof
258、 SFT samples for each instance:the first couples the problem with its original response inthe format of,while the second incorporates a system promptalongside the problem and the R1 response in the format of.The system prompt is meticulously designed to include instructions that guide the modeltowar
259、d producing responses enriched with mechanisms for reflection and verification.Duringthe RL phase,the model leverages high-temperature sampling to generate responses thatintegrate patterns from both the R1-generated and original data,even in the absence of explicitsystem prompts.After hundreds of RL
260、 steps,the intermediate RL model learns to incorporateR1 patterns,thereby enhancing overall performance strategically.Upon completing the RL training phase,we implement rejection sampling to curate high-quality SFT data for the final model,where the expert models are used as data generationsources.T
261、his method ensures that the final training data retains the strengths of DeepSeek-R1while producing responses that are concise and effective.Non-Reasoning Data.For non-reasoning data,such as creative writing,role-play,and sim-ple question answering,we utilize DeepSeek-V2.5 to generate responses and
262、enlist humanannotators to verify the accuracy and correctness of the data.SFT Settings.We fine-tune DeepSeek-V3-Base for two epochs using the SFT dataset,using thecosine decay learning rate scheduling that starts at 5106and gradually decreases to 1106.During training,each single sequence is packed f
263、rom multiple samples.However,we adopt asample masking strategy to ensure that these examples remain isolated and mutually invisible.5.2.Reinforcement Learning5.2.1.Reward ModelWe employ a rule-based Reward Model(RM)and a model-based RM in our RL process.Rule-Based RM.For questions that can be valida
264、ted using specific rules,we adopt a rule-based reward system to determine the feedback.For instance,certain math problems havedeterministic results,and we require the model to provide the final answer within a designatedformat(e.g.,in a box),allowing us to apply rules to verify the correctness.Simil
265、arly,for LeetCodeproblems,we can utilize a compiler to generate feedback based on test cases.By leveragingrule-based validation wherever possible,we ensure a higher level of reliability,as this approachis resistant to manipulation or exploitation.Model-Based RM.For questions with free-form ground-tr
266、uth answers,we rely on the rewardmodel to determine whether the response matches the expected ground-truth.Conversely,forquestions without a definitive ground-truth,such as those involving creative writing,the rewardmodel is tasked with providing feedback based on the question and the corresponding
267、answer29as inputs.The reward model is trained from the DeepSeek-V3 SFT checkpoints.To enhance itsreliability,we construct preference data that not only provides the final reward but also includesthe chain-of-thought leading to the reward.This approach helps mitigate the risk of rewardhacking in spec
268、ific tasks.5.2.2.Group Relative Policy OptimizationSimilar to DeepSeek-V2(DeepSeek-AI,2024c),we adopt Group Relative Policy Optimiza-tion(GRPO)(Shao et al.,2024),which foregoes the critic model that is typically with the samesize as the policy model,and estimates the baseline from group scores inste
269、ad.Specifically,foreach question,GRPO samples a group of outputs1,2,from the old policy modeland then optimizes the policy modelby maximizing the following objective:J()=E(),=1(|)1=1?min?(|)(|),clip?(|)(|),1,1+?D?|?,(26)D?|?=(|)(|)log(|)(|)1,(27)whereandare hyper-parameters;is the reference model;an
270、dis the advantage,derivedfrom the rewards 1,2,.,corresponding to the outputs within each group:=mean(1,2,)std(1,2,).(28)We incorporate prompts from diverse domains,such as coding,math,writing,role-playing,and question answering,during the RL process.This approach not only aligns the model moreclosel
271、y with human preferences but also enhances performance on benchmarks,especially inscenarios where available SFT data are limited.5.3.Evaluations5.3.1.Evaluation SettingsEvaluation Benchmarks.Apart from the benchmark we used for base model testing,wefurther evaluate instructed models on IFEval(Zhou e
272、t al.,2023),FRAMES(Krishna et al.,2024),LongBench v2(Bai et al.,2024),GPQA(Rein et al.,2023),SimpleQA(OpenAI,2024c),C-SimpleQA(He et al.,2024),SWE-Bench Verified(OpenAI,2024d),Aider1,LiveCodeBench(Jainet al.,2024)(questions from August 2024 to November 2024),Codeforces2,Chinese NationalHigh School M
273、athematics Olympiad(CNMO 2024)3,and American Invitational MathematicsExamination 2024(AIME 2024)(MAA,2024).Compared Baselines.We conduct comprehensive evaluations of our chat model against sev-eral strong baselines,including DeepSeek-V2-0506,DeepSeek-V2.5-0905,Qwen2.5 72B Instruct,LLaMA-3.1 405B Ins
274、truct,Claude-Sonnet-3.5-1022,and GPT-4o-0513.For the DeepSeek-V2model series,we select the most representative variants for comparison.For closed-sourcemodels,evaluations are performed through their respective APIs.1https:/aider.chat2https:/3https:/ Evaluation Configurations.For standard benchmarks
275、including MMLU,DROP,GPQA,and SimpleQA,we adopt the evaluation prompts from the simple-evals framework4.We utilize the Zero-Eval prompt format(Lin,2024)for MMLU-Redux in a zero-shot setting.For other datasets,we follow their original evaluation protocols with default prompts as pro-vided by the datas
276、et creators.For code and math benchmarks,the HumanEval-Mul datasetincludes 8 mainstream programming languages(Python,Java,Cpp,C#,JavaScript,TypeScript,PHP,and Bash)in total.We use CoT and non-CoT methods to evaluate model performanceon LiveCodeBench,where the data are collected from August 2024 to N
277、ovember 2024.TheCodeforces dataset is measured using the percentage of competitors.SWE-Bench verified isevaluated using the agentless framework(Xia et al.,2024).We use the“diff”format to evaluatethe Aider-related benchmarks.For mathematical assessments,AIME and CNMO 2024 areevaluated with a temperat
278、ure of 0.7,and the results are averaged over 16 runs,while MATH-500employs greedy decoding.We allow all models to output a maximum of 8192 tokens for eachbenchmark.Benchmark(Metric)DeepSeek DeepSeek Qwen2.5 LLaMA-3.1 Claude-3.5-GPT-4o DeepSeekV2-0506V2.5-0905 72B-Inst.405B-Inst.Sonnet-10220513V3Arch
279、itectureMoEMoEDenseDense-MoE#Activated Params21B21B72B405B-37B#Total Params236B236B72B405B-671BEnglishMMLU(EM)78.280.685.388.688.387.288.5MMLU-Redux(EM)77.980.385.686.288.988.089.1MMLU-Pro(EM)58.566.271.673.378.072.675.9DROP(3-shot F1)83.087.876.788.788.383.791.6IF-Eval(Prompt Strict)57.780.684.186.
280、086.584.386.1GPQA-Diamond(Pass1)35.341.349.051.165.049.959.1SimpleQA(Correct)9.010.29.117.128.438.224.9FRAMES(Acc.)66.965.469.870.072.580.573.3LongBench v2(Acc.)31.635.439.436.141.048.148.7CodeHumanEval-Mul(Pass1)69.377.477.377.281.780.582.6LiveCodeBench(Pass1-COT)18.829.231.128.436.333.440.5LiveCod
281、eBench(Pass1)20.328.428.730.132.834.237.6Codeforces(Percentile)17.535.624.825.320.323.651.6SWE Verified(Resolved)-22.623.824.550.838.842.0Aider-Edit(Acc.)60.371.665.463.984.272.979.7Aider-Polyglot(Acc.)-18.27.65.845.316.049.6MathAIME 2024(Pass1)4.616.723.323.316.09.339.2MATH-500(EM)56.374.780.073.87
282、8.374.690.2CNMO 2024(Pass1)2.810.815.96.813.110.843.2ChineseCLUEWSC(EM)89.990.491.484.785.487.990.9C-Eval(EM)78.679.586.161.576.776.086.5C-SimpleQA(Correct)48.554.148.450.451.359.364.8Table 6|Comparison between DeepSeek-V3 and other representative chat models.All modelsare evaluated in a configurati
283、on that limits the output length to 8K.Benchmarks containingfewer than 1000 samples are tested multiple times using varying temperature settings to deriverobust final results.DeepSeek-V3 stands as the best-performing open-source model,and alsoexhibits competitive performance against frontier closed-
284、source models.4https:/ EvaluationTable 6 presents the evaluation results,showcasing that DeepSeek-V3 stands as the best-performing open-source model.Additionally,it is competitive against frontier closed-sourcemodels like GPT-4o and Claude-3.5-Sonnet.English Benchmarks.MMLU is a widely recognized be
285、nchmark designed to assess the perfor-mance of large language models,across diverse knowledge domains and tasks.DeepSeek-V3demonstrates competitive performance,standing on par with top-tier models such as LLaMA-3.1-405B,GPT-4o,and Claude-Sonnet 3.5,while significantly outperforming Qwen2.5 72B.Moreo
286、ver,DeepSeek-V3 excels in MMLU-Pro,a more challenging educational knowledgebenchmark,where it closely trails Claude-Sonnet 3.5.On MMLU-Redux,a refined version ofMMLU with corrected labels,DeepSeek-V3 surpasses its peers.In addition,on GPQA-Diamond,a PhD-level evaluation testbed,DeepSeek-V3 achieves
287、remarkable results,ranking just behindClaude 3.5 Sonnet and outperforming all other competitors by a substantial margin.In long-context understanding benchmarks such as DROP,LongBench v2,and FRAMES,DeepSeek-V3 continues to demonstrate its position as a top-tier model.It achieves an impressive91.6 F1
288、 score in the 3-shot setting on DROP,outperforming all other models in this category.On FRAMES,a benchmark requiring question-answering over 100k token contexts,DeepSeek-V3 closely trails GPT-4o while outperforming all other models by a significant margin.Thisdemonstrates the strong capability of De
289、epSeek-V3 in handling extremely long-context tasks.The long-context capability of DeepSeek-V3 is further validated by its best-in-class performanceon LongBench v2,a dataset that was released just a few weeks before the launch of DeepSeekV3.On the factual knowledge benchmark,SimpleQA,DeepSeek-V3 fall
290、s behind GPT-4o andClaude-Sonnet,primarily due to its design focus and resource allocation.DeepSeek-V3 assignsmore training tokens to learn Chinese knowledge,leading to exceptional performance on theC-SimpleQA.On the instruction-following benchmark,DeepSeek-V3 significantly outperformsits predecesso
291、r,DeepSeek-V2-series,highlighting its improved ability to understand and adhereto user-defined format constraints.Code and Math Benchmarks.Coding is a challenging and practical task for LLMs,encom-passing engineering-focused tasks like SWE-Bench-Verified and Aider,as well as algorithmictasks such as
292、 HumanEval and LiveCodeBench.In engineering tasks,DeepSeek-V3 trails behindClaude-Sonnet-3.5-1022 but significantly outperforms open-source models.The open-sourceDeepSeek-V3 is expected to foster advancements in coding-related engineering tasks.By pro-viding access to its robust capabilities,DeepSee
293、k-V3 can drive innovation and improvementin areas such as software engineering and algorithm development,empowering developersand researchers to push the boundaries of what open-source models can achieve in codingtasks.In algorithmic tasks,DeepSeek-V3 demonstrates superior performance,outperforminga
294、ll baselines on benchmarks like HumanEval-Mul and LiveCodeBench.This success can beattributed to its advanced knowledge distillation technique,which effectively enhances its codegeneration and problem-solving capabilities in algorithm-focused tasks.On math benchmarks,DeepSeek-V3 demonstrates excepti
295、onal performance,significantlysurpassing baselines and setting a new state-of-the-art for non-o1-like models.Specifically,onAIME,MATH-500,and CNMO 2024,DeepSeek-V3 outperforms the second-best model,Qwen2.572B,by approximately 10%in absolute scores,which is a substantial margin for such challengingbe
296、nchmarks.This remarkable capability highlights the effectiveness of the distillation techniquefrom DeepSeek-R1,which has been proven highly beneficial for non-o1-like models.32ModelArena-HardAlpacaEval 2.0DeepSeek-V2.5-090576.250.5Qwen2.5-72B-Instruct81.249.1LLaMA-3.1 405B69.340.5GPT-4o-051380.451.1
297、Claude-Sonnet-3.5-102285.252.0DeepSeek-V385.570.0Table 7|English open-ended conversation evaluations.For AlpacaEval 2.0,we use the length-controlled win rate as the metric.Chinese Benchmarks.Qwen and DeepSeek are two representative model series with robustsupport for both Chinese and English.On the
298、factual benchmark Chinese SimpleQA,DeepSeek-V3 surpasses Qwen2.5-72B by 16.4 points,despite Qwen2.5 being trained on a larger corpuscompromising 18T tokens,which are 20%more than the 14.8T tokens that DeepSeek-V3 ispre-trained on.On C-Eval,a representative benchmark for Chinese educational knowledge
299、 evaluation,andCLUEWSC(Chinese Winograd Schema Challenge),DeepSeek-V3 and Qwen2.5-72B exhibitsimilar performance levels,indicating that both models are well-optimized for challengingChinese-language reasoning and educational tasks.5.3.3.Open-Ended EvaluationIn addition to standard benchmarks,we also
300、 evaluate our models on open-ended generationtasks using LLMs as judges,with the results shown in Table 7.Specifically,we adhere tothe original configurations of AlpacaEval 2.0(Dubois et al.,2024)and Arena-Hard(Li et al.,2024a),which leverage GPT-4-Turbo-1106 as judges for pairwise comparisons.On Ar
301、ena-Hard,DeepSeek-V3 achieves an impressive win rate of over 86%against the baseline GPT-4-0314,performing on par with top-tier models like Claude-Sonnet-3.5-1022.This underscores therobust capabilities of DeepSeek-V3,especially in dealing with complex prompts,includingcoding and debugging tasks.Fur
302、thermore,DeepSeek-V3 achieves a groundbreaking milestoneas the first open-source model to surpass 85%on the Arena-Hard benchmark.This achievementsignificantly bridges the performance gap between open-source and closed-source models,setting a new standard for what open-source models can accomplish in
303、 challenging domains.Similarly,DeepSeek-V3 showcases exceptional performance on AlpacaEval 2.0,outperform-ing bothclosed-source andopen-source models.This demonstrates itsoutstanding proficiencyinwriting tasks and handling straightforward question-answering scenarios.Notably,it surpassesDeepSeek-V2.
304、5-0905 by a significant margin of 20%,highlighting substantial improvements intackling simple tasks and showcasing the effectiveness of its advancements.5.3.4.DeepSeek-V3 as a Generative Reward ModelWe compare the judgment ability of DeepSeek-V3 with state-of-the-art models,namely GPT-4oand Claude-3
305、.5.Table 8 presents the performance of these models in RewardBench(Lambertet al.,2024).DeepSeek-V3 achieves performance on par with the best versions of GPT-4o-0806and Claude-3.5-Sonnet-1022,while surpassing other versions.Additionally,the judgment abilityof DeepSeek-V3 can also be enhanced by the v
306、oting technique.Therefore,we employ DeepSeek-V3 along with voting to offer self-feedback on open-ended questions,thereby improving the33ModelChatChat-HardSafetyReasoningAverageGPT-4o-051396.670.486.784.984.7GPT-4o-080696.176.188.186.686.7GPT-4o-112095.871.386.285.284.6Claude-3.5-sonnet-062096.474.08
307、1.684.784.2Claude-3.5-sonnet-102296.479.791.187.688.7DeepSeek-V396.979.887.084.387.0DeepSeek-V3(maj6)96.982.689.589.289.6Table 8|Performances of GPT-4o,Claude-3.5-sonnet and DeepSeek-V3 on RewardBench.ModelLiveCodeBench-CoTMATH-500Pass1LengthPass1LengthDeepSeek-V2.5 Baseline31.171874.6769DeepSeek-V2
308、.5+R1 Distill37.478383.21510Table 9|The contribution of distillation from DeepSeek-R1.The evaluation settings of Live-CodeBench and MATH-500 are the same as in Table 6.effectiveness and robustness of the alignment process.5.4.Discussion5.4.1.Distillation from DeepSeek-R1We ablate the contribution of
309、 distillation from DeepSeek-R1 based on DeepSeek-V2.5.Thebaseline is trained on short CoT data,whereas its competitor uses data generated by the expertcheckpoints described above.Table 9 demonstrates the effectiveness of the distillation data,showing significant improve-ments in both LiveCodeBench a
310、nd MATH-500 benchmarks.Our experiments reveal an inter-esting trade-off:the distillation leads to better performance but also substantially increases theaverage response length.To maintain a balance between model accuracy and computationalefficiency,we carefully selected optimal settings for DeepSee
311、k-V3 in distillation.Our research suggests that knowledge distillation from reasoning models presents a promis-ing direction for post-training optimization.While our current work focuses on distilling datafrom mathematics and coding domains,this approach shows potential for broader applicationsacros
312、s various task domains.The effectiveness demonstrated in these specific areas indicatesthat long-CoT distillation could be valuable for enhancing model performance in other cogni-tive tasks requiring complex reasoning.Further exploration of this approach across differentdomains remains an important
313、direction for future research.5.4.2.Self-RewardingRewards play a pivotal role in RL,steering the optimization process.In domains where verifica-tion through external tools is straightforward,such as some coding or mathematics scenarios,RLdemonstrates exceptional efficacy.However,in more general scen
314、arios,constructing a feedback34mechanism through hard coding is impractical.During the development of DeepSeek-V3,forthese broader contexts,we employ the constitutional AI approach(Bai et al.,2022),leveragingthe voting evaluation results of DeepSeek-V3 itself as a feedback source.This method hasprod
315、uced notable alignment effects,significantly enhancing the performance of DeepSeek-V3in subjective evaluations.By integrating additional constitutional inputs,DeepSeek-V3 canoptimize towards the constitutional direction.We believe that this paradigm,which combinessupplementary information with LLMs
316、as a feedback source,is of paramount importance.TheLLM serves as a versatile processor capable of transforming unstructured information fromdiverse scenarios into rewards,ultimately facilitating the self-improvement of LLMs.Beyondself-rewarding,we are also dedicated to uncovering other general and s
317、calable rewardingmethods to consistently advance the model capabilities in general scenarios.5.4.3.Multi-Token Prediction EvaluationInstead of predicting just the next single token,DeepSeek-V3 predicts the next 2 tokens throughthe MTP technique.Combined with the framework of speculative decoding(Lev
318、iathan et al.,2023;Xia et al.,2023),it can significantly accelerate the decoding speed of the model.A naturalquestion arises concerning the acceptance rate of the additionally predicted token.Based onour evaluation,the acceptance rate of the second token prediction ranges between 85%and 90%across va
319、rious generation topics,demonstrating consistent reliability.This high acceptance rateenables DeepSeek-V3 to achieve a significantly improved decoding speed,delivering 1.8 timesTPS(Tokens Per Second).6.Conclusion,Limitations,and Future DirectionsIn this paper,we introduce DeepSeek-V3,a large MoE lan
320、guage model with 671B total pa-rameters and 37B activated parameters,trained on 14.8T tokens.In addition to the MLA andDeepSeekMoE architectures,it also pioneers an auxiliary-loss-free strategy for load balancingand sets a multi-token prediction training objective for stronger performance.The traini
321、ng ofDeepSeek-V3 is cost-effective due to the support of FP8 training and meticulous engineering op-timizations.The post-training also makes a success in distilling the reasoning capability from theDeepSeek-R1 series of models.Comprehensive evaluations demonstrate that DeepSeek-V3 hasemerged as the
322、strongest open-source model currently available,and achieves performance com-parable to leading closed-source models like GPT-4o and Claude-3.5-Sonnet.Despite its strongperformance,it also maintains economical training costs.It requires only 2.788M H800 GPUhours for its full training,including pre-t
323、raining,context length extension,and post-training.While acknowledging its strong performance and cost-effectiveness,we also recognize thatDeepSeek-V3 has some limitations,especially on the deployment.Firstly,to ensure efficientinference,the recommended deployment unit for DeepSeek-V3 is relatively
324、large,which mightpose a burden for small-sized teams.Secondly,although our deployment strategy for DeepSeek-V3 has achieved an end-to-end generation speed of more than two times that of DeepSeek-V2,there still remains potential for further enhancement.Fortunately,these limitations are expectedto be
325、naturally addressed with the development of more advanced hardware.DeepSeek consistently adheres to the route of open-source models with longtermism,aimingto steadily approach the ultimate goal of AGI(Artificial General Intelligence).In the future,weplan to strategically invest in research across th
326、e following directions.We will consistently study and refine our model architectures,aiming to further improve35both the training and inference efficiency,striving to approach efficient support for infinitecontext length.Additionally,we will try to break through the architectural limitations ofTrans
327、former,thereby pushing the boundaries of its modeling capabilities.We will continuously iterate on the quantity and quality of our training data,and explorethe incorporation of additional training signal sources,aiming to drive data scaling acrossa more comprehensive range of dimensions.We will cons
328、istently explore and iterate on the deep thinking capabilities of our models,aiming to enhance their intelligence and problem-solving abilities by expanding theirreasoning length and depth.We will explore more comprehensive and multi-dimensional model evaluation methods toprevent the tendency toward
329、s optimizing a fixed set of benchmarks during research,whichmay create a misleading impression of the model capabilities and affect our foundationalassessment.ReferencesAIMeta.Llama 3 model card,2024a.URLhttps:/ 3.1 model card,2024b.URLhttps:/ 3.5 sonnet,2024.URLhttps:/ al.Program synthesis with lar
330、ge language models.arXiv preprint arXiv:2108.07732,2021.Y.Bai,S.Kadavath,S.Kundu,A.Askell,J.Kernion,A.Jones,A.Chen,A.Goldie,A.Mirhoseini,C.McKinnon,et al.Constitutional AI:Harmlessness from AI feedback.arXiv preprintarXiv:2212.08073,2022.Y.Bai,S.Tu,J.Zhang,H.Peng,X.Wang,X.Lv,S.Cao,J.Xu,L.Hou,Y.Dong,
331、J.Tang,andJ.Li.LongBench v2:Towards deeper understanding and reasoning on realistic long-contextmultitasks.arXiv preprint arXiv:2412.15204,2024.M.Bauer,S.Treichler,and A.Aiken.Singe:leveraging warp specialization for high performanceon GPUs.InProceedings of the 19th ACM SIGPLAN Symposium on Principl
332、es and Practiceof Parallel Programming,PPoPP 14,page 119130,New York,NY,USA,2014.Associationfor Computing Machinery.ISBN 9781450326568.doi:10.1145/2555243.2555258.URLhttps:/doi.org/10.1145/2555243.2555258.Y.Bisk,R.Zellers,R.L.Bras,J.Gao,and Y.Choi.PIQA:reasoning about physical commonsensein natural
333、language.InThe Thirty-Fourth AAAI Conference on Artificial Intelligence,AAAI2020,The Thirty-Second Innovative Applications of Artificial Intelligence Conference,IAAI2020,The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence,EAAI2020,New York,NY,USA,February 7-12,2020,pages 74327439.AAAI Press,2020.doi:10.1609/aaai.v34i05.6239.URLhttps:/doi.org/10.1609/aaai.v34i05.6239.M.Chen,