1、DeepSeek-R1:Incentivizing Reasoning Capability in LLMs viaReinforcement LearningDeepSeek-AIAbstractWe introduce our first-generation reasoning models,DeepSeek-R1-Zero and DeepSeek-R1.DeepSeek-R1-Zero,a model trained via large-scale reinforcement learning(RL)without super-vised fine-tuning(SFT)as a p
2、reliminary step,demonstrates remarkable reasoning capabilities.Through RL,DeepSeek-R1-Zero naturally emerges with numerous powerful and intriguingreasoning behaviors.However,it encounters challenges such as poor readability,and languagemixing.To address these issues and further enhance reasoning per
3、formance,we introduceDeepSeek-R1,which incorporates multi-stage training and cold-start data before RL.DeepSeek-R1 achieves performance comparable to OpenAI-o1-1217 on reasoning tasks.To support theresearch community,we open-source DeepSeek-R1-Zero,DeepSeek-R1,and six dense models(1.5B,7B,8B,14B,32B
4、,70B)distilled from DeepSeek-R1 based on Qwen and Llama.AIME 2024(Pass1)Codeforces(Percentile)GPQA Diamond(Pass1)MATH-500(Pass1)MMLU(Pass1)SWE-bench Verified(Resolved)020406080100Accuracy/Percentile(%)79.896.371.597.390.849.279.296.675.796.491.848.972.690.662.194.387.436.863.693.460.090.085.241.639.
5、258.759.190.288.542.0DeepSeek-R1OpenAI-o1-1217DeepSeek-R1-32BOpenAI-o1-miniDeepSeek-V3Figure 1|Benchmark performance of DeepSeek-R1.Contents1Introduction31.1Contributions.41.2Summary of Evaluation Results.42Approach52.1Overview.52.2DeepSeek-R1-Zero:Reinforcement Learning on the Base Model.52.2.1Rein
6、forcement Learning Algorithm.52.2.2Reward Modeling.62.2.3Training Template.62.2.4Performance,Self-evolution Process and Aha Moment of DeepSeek-R1-Zero62.3DeepSeek-R1:Reinforcement Learning with Cold Start.92.3.1Cold Start.92.3.2Reasoning-oriented Reinforcement Learning.102.3.3Rejection Sampling and