DeepSeek V3技术报告（英文原版+译版）（53页）.pdf

DeepSeek V3技术报告（英文版）（53页）.pdf

《DeepSeek V3技术报告（英文版）（53页）.pdf》由会员分享，可在线阅读，更多相关《DeepSeek V3技术报告（英文版）（53页）.pdf（53页珍藏版）》请在三个皮匠报告上搜索。

1、DeepSeek-V3 Technical ReportDeepSeek-AIAbstractWe present DeepSeek-V3,a strong Mixture-of-Experts(MoE)language model with 671B totalparameters with 37B activated for each token.To achieve efficient inference and cost-effectivetraining,DeepSeek-V3 adopts Multi-head Latent Attention(MLA)and DeepSeekMo

2、E architec-tures,which were thoroughly validated in DeepSeek-V2.Furthermore,DeepSeek-V3 pioneersan auxiliary-loss-free strategy for load balancing and sets a multi-token prediction trainingobjective for stronger performance.We pre-train DeepSeek-V3 on 14.8 trillion diverse andhigh-quality tokens,fol

3、lowed by Supervised Fine-Tuning and Reinforcement Learning stages tofully harness its capabilities.Comprehensive evaluations reveal that DeepSeek-V3 outperformsother open-source models and achieves performance comparable to leading closed-sourcemodels.Despite its excellent performance,DeepSeek-V3 re

4、quires only 2.788M H800 GPU hoursfor its full training.In addition,its training process is remarkably stable.Throughout the entiretraining process,we did not experience any irrecoverable loss spikes or perform any rollbacks.The model checkpoints are available athttps:/ 500(EM)AIME 2024(Pass1)Codefor

5、ces(Percentile)SWE-bench Verified(Resolved)020406080100Accuracy/Percentile(%)75.959.190.239.251.642.066.241.374.716.735.622.671.649.080.023.324.823.873.351.173.823.325.324.572.649.974.69.323.638.878.065.078.316.020.350.8DeepSeek-V3DeepSeek-V2.5Qwen2.5-72B-InstLlama-3.1-405B-InstGPT-4o-0513Claude-3.5

6、-Sonnet-1022Figure 1|Benchmark performance of DeepSeek-V3 and its counterparts.Contents1Introduction42Architecture62.1Basic Architecture.62.1.1Multi-Head Latent Attention.72.1.2DeepSeekMoE with Auxiliary-Loss-Free Load Balancing.82.2Multi-Token Prediction.103Infrastructures113.1Compute Clusters.113.