1、DeepSeek-V3 Technical ReportDeepSeek-AIAbstractWe present DeepSeek-V3,a strong Mixture-of-Experts(MoE)language model with 671B totalparameters with 37B activated for each token.To achieve efficient inference and cost-effectivetraining,DeepSeek-V3 adopts Multi-head Latent Attention(MLA)and DeepSeekMo
2、E architec-tures,which were thoroughly validated in DeepSeek-V2.Furthermore,DeepSeek-V3 pioneersan auxiliary-loss-free strategy for load balancing and sets a multi-token prediction trainingobjective for stronger performance.We pre-train DeepSeek-V3 on 14.8 trillion diverse andhigh-quality tokens,fol
3、lowed by Supervised Fine-Tuning and Reinforcement Learning stages tofully harness its capabilities.Comprehensive evaluations reveal that DeepSeek-V3 outperformsother open-source models and achieves performance comparable to leading closed-sourcemodels.Despite its excellent performance,DeepSeek-V3 re
4、quires only 2.788M H800 GPU hoursfor its full training.In addition,its training process is remarkably stable.Throughout the entiretraining process,we did not experience any irrecoverable loss spikes or perform any rollbacks.The model checkpoints are available athttps:/ 500(EM)AIME 2024(Pass1)Codefor
5、ces(Percentile)SWE-bench Verified(Resolved)020406080100Accuracy/Percentile(%)75.959.190.239.251.642.066.241.374.716.735.622.671.649.080.023.324.823.873.351.173.823.325.324.572.649.974.69.323.638.878.065.078.316.020.350.8DeepSeek-V3DeepSeek-V2.5Qwen2.5-72B-InstLlama-3.1-405B-InstGPT-4o-0513Claude-3.5
6、-Sonnet-1022Figure 1|Benchmark performance of DeepSeek-V3 and its counterparts.arXiv:2412.19437v2 cs.CL 18 Feb 2025Contents1Introduction42Architecture62.1Basic Architecture.62.1.1Multi-Head Latent Attention.72.1.2DeepSeekMoE with Auxiliary-Loss-Free Load Balancing.82.2Multi-Token Prediction.103Infra