《DeepSeek R1技术报告(英文版)(22页).pdf》由会员分享,可在线阅读,更多相关《DeepSeek R1技术报告(英文版)(22页).pdf(22页珍藏版)》请在三个皮匠报告上搜索。
1、DeepSeek-R1:Incentivizing Reasoning Capability in LLMs viaReinforcement LearningDeepSeek-AIAbstractWe introduce our first-generation reasoning models,DeepSeek-R1-Zero and DeepSeek-R1.DeepSeek-R1-Zero,a model trained via large-scale reinforcement learning(RL)without super-vised fine-tuning(SFT)as a p
2、reliminary step,demonstrates remarkable reasoning capabilities.Through RL,DeepSeek-R1-Zero naturally emerges with numerous powerful and intriguingreasoning behaviors.However,it encounters challenges such as poor readability,and languagemixing.To address these issues and further enhance reasoning per
3、formance,we introduceDeepSeek-R1,which incorporates multi-stage training and cold-start data before RL.DeepSeek-R1 achieves performance comparable to OpenAI-o1-1217 on reasoning tasks.To support theresearch community,we open-source DeepSeek-R1-Zero,DeepSeek-R1,and six dense models(1.5B,7B,8B,14B,32B
4、,70B)distilled from DeepSeek-R1 based on Qwen and Llama.AIME 2024(Pass1)Codeforces(Percentile)GPQA Diamond(Pass1)MATH-500(Pass1)MMLU(Pass1)SWE-bench Verified(Resolved)020406080100Accuracy/Percentile(%)79.896.371.597.390.849.279.296.675.796.491.848.972.690.662.194.387.436.863.693.460.090.085.241.639.
5、258.759.190.288.542.0DeepSeek-R1OpenAI-o1-1217DeepSeek-R1-32BOpenAI-o1-miniDeepSeek-V3Figure 1|Benchmark performance of DeepSeek-R1.Contents1Introduction31.1Contributions.41.2Summary of Evaluation Results.42Approach52.1Overview.52.2DeepSeek-R1-Zero:Reinforcement Learning on the Base Model.52.2.1Rein
6、forcement Learning Algorithm.52.2.2Reward Modeling.62.2.3Training Template.62.2.4Performance,Self-evolution Process and Aha Moment of DeepSeek-R1-Zero62.3DeepSeek-R1:Reinforcement Learning with Cold Start.92.3.1Cold Start.92.3.2Reasoning-oriented Reinforcement Learning.102.3.3Rejection Sampling and
7、Supervised Fine-Tuning.102.3.4Reinforcement Learning for all Scenarios.112.4Distillation:Empower Small Models with Reasoning Capability.113Experiment113.1DeepSeek-R1 Evaluation.133.2Distilled Model Evaluation.144Discussion144.1Distillation v.s.Reinforcement Learning.144.2Unsuccessful Attempts.155Con
8、clusion,Limitations,and Future Work16A Contributions and Acknowledgments2021.IntroductionIn recent years,Large Language Models(LLMs)have been undergoing rapid iteration andevolution(Anthropic,2024;Google,2024;OpenAI,2024a),progressively diminishing the gaptowards Artificial General Intelligence(AGI)
9、.Recently,post-training has emerged as an important component of the full training pipeline.It has been shown to enhance accuracy on reasoning tasks,align with social values,and adaptto user preferences,all while requiring relatively minimal computational resources againstpre-training.In the context
10、 of reasoning capabilities,OpenAIs o1(OpenAI,2024b)series modelswere the first to introduce inference-time scaling by increasing the length of the Chain-of-Thought reasoning process.This approach has achieved significant improvements in variousreasoning tasks,such as mathematics,coding,and scientifi
11、c reasoning.However,the challengeof effective test-time scaling remains an open question for the research community.Several priorworks have explored various approaches,including process-based reward models(Lightmanet al.,2023;Uesato et al.,2022;Wang et al.,2023),reinforcement learning(Kumar et al.,2
12、024),and search algorithms such as Monte Carlo Tree Search and Beam Search(Feng et al.,2024;Trinhet al.,2024;Xin et al.,2024).However,none of these methods has achieved general reasoningperformance comparable to OpenAIs o1 series models.In this paper,we take the first step toward improving language
13、model reasoning capabilitiesusing pure reinforcement learning(RL).Our goal is to explore the potential of LLMs to developreasoning capabilities without any supervised data,focusing on their self-evolution througha pure RL process.Specifically,we use DeepSeek-V3-Base as the base model and employGRPO(
14、Shao et al.,2024)as the RL framework to improve model performance in reasoning.During training,DeepSeek-R1-Zero naturally emerged with numerous powerful and interestingreasoningbehaviors.AfterthousandsofRLsteps,DeepSeek-R1-Zeroexhibitssuperperformanceon reasoning benchmarks.For instance,the pass1 sc
15、ore on AIME 2024 increases from 15.6%to71.0%,and with majority voting,the score further improves to 86.7%,matching the performanceof OpenAI-o1-0912.However,DeepSeek-R1-Zero encounters challenges such as poor readability,and languagemixing.To address these issues and further enhance reasoning perform
16、ance,we introduceDeepSeek-R1,which incorporates a small amount of cold-start data and a multi-stage trainingpipeline.Specifically,we begin by collecting thousands of cold-start data to fine-tune theDeepSeek-V3-Base model.Following this,we perform reasoning-oriented RL like DeepSeek-R1-Zero.Upon near
17、ing convergence in the RL process,we create new SFT data through rejectionsampling on the RL checkpoint,combined with supervised data from DeepSeek-V3 in domainssuch as writing,factual QA,and self-cognition,and then retrain the DeepSeek-V3-Base model.After fine-tuning with the new data,the checkpoin
18、t undergoes an additional RL process,takinginto account prompts from all scenarios.After these steps,we obtained a checkpoint referred toas DeepSeek-R1,which achieves performance on par with OpenAI-o1-1217.We further explore distillation from DeepSeek-R1 to smaller dense models.Using Qwen2.5-32B(Qwe
19、n,2024b)as the base model,direct distillation from DeepSeek-R1 outperforms applyingRL on it.This demonstrates that the reasoning patterns discovered by larger base models are cru-cial for improving reasoning capabilities.We open-source the distilled Qwen and Llama(Dubeyet al.,2024)series.Notably,our
20、 distilled 14B model outperforms state-of-the-art open-sourceQwQ-32B-Preview(Qwen,2024a)by a large margin,and the distilled 32B and 70B models set anew record on the reasoning benchmarks among dense models.31.1.ContributionsPost-Training:Large-Scale Reinforcement Learning on the Base ModelWe directl
21、y apply RL to the base model without relying on supervised fine-tuning(SFT)asa preliminary step.This approach allows the model to explore chain-of-thought(CoT)forsolving complex problems,resulting in the development of DeepSeek-R1-Zero.DeepSeek-R1-Zero demonstrates capabilities such as self-verifica
22、tion,reflection,and generatinglong CoTs,marking a significant milestone for the research community.Notably,it is thefirst open research to validate that reasoning capabilities of LLMs can be incentivizedpurely through RL,without the need for SFT.This breakthrough paves the way for futureadvancements
23、 in this area.We introduce our pipeline to develop DeepSeek-R1.The pipeline incorporates two RLstages aimed at discovering improved reasoning patterns and aligning with human pref-erences,as well as two SFT stages that serve as the seed for the models reasoning andnon-reasoning capabilities.We belie
24、ve the pipeline will benefit the industry by creatingbetter models.Distillation:Smaller Models Can Be Powerful TooWe demonstrate that the reasoning patterns of larger models can be distilled into smallermodels,resulting in better performance compared to the reasoning patterns discoveredthrough RL on
25、 small models.The open source DeepSeek-R1,as well as its API,will benefitthe research community to distill better smaller models in the future.Using the reasoning data generated by DeepSeek-R1,we fine-tuned several dense modelsthat are widely used in the research community.The evaluation results dem
26、onstrate thatthe distilled smaller dense models perform exceptionally well on benchmarks.DeepSeek-R1-Distill-Qwen-7B achieves 55.5%on AIME 2024,surpassing QwQ-32B-Preview.Addi-tionally,DeepSeek-R1-Distill-Qwen-32B scores 72.6%on AIME 2024,94.3%on MATH-500,and 57.2%on LiveCodeBench.These results sign
27、ificantly outperform previous open-source models and are comparable to o1-mini.We open-source distilled 1.5B,7B,8B,14B,32B,and 70B checkpoints based on Qwen2.5 and Llama3 series to the community.1.2.Summary of Evaluation ResultsReasoning tasks:(1)DeepSeek-R1 achieves a score of 79.8%Pass1 on AIME 20
28、24,slightlysurpassing OpenAI-o1-1217.On MATH-500,it attains an impressive score of 97.3%,performing on par with OpenAI-o1-1217 and significantly outperforming other models.(2)On coding-related tasks,DeepSeek-R1 demonstrates expert level in code competition tasks,as it achieves 2,029 Elo rating on Co
29、deforces outperforming 96.3%human participants inthe competition.For engineering-related tasks,DeepSeek-R1 performs slightly better thanDeepSeek-V3,which could help developers in real world tasks.Knowledge:OnbenchmarkssuchasMMLU,MMLU-Pro,andGPQADiamond,DeepSeek-R1 achieves outstanding results,signif
30、icantly outperforming DeepSeek-V3 with scoresof 90.8%on MMLU,84.0%on MMLU-Pro,and 71.5%on GPQA Diamond.While itsperformance is slightly below that of OpenAI-o1-1217 on these benchmarks,DeepSeek-R1surpasses other closed-source models,demonstrating its competitive edge in educationaltasks.On the factu
31、al benchmark SimpleQA,DeepSeek-R1 outperforms DeepSeek-V3,demonstrating its capability in handling fact-based queries.A similar trend is observedwhere OpenAI-o1 surpasses 4o on this benchmark.4Others:DeepSeek-R1 also excels in a wide range of tasks,including creative writing,general question answeri
32、ng,editing,summarization,and more.It achieves an impressivelength-controlled win-rate of 87.6%on AlpacaEval 2.0 and a win-rate of 92.3%on Are-naHard,showcasing its strong ability to intelligently handle non-exam-oriented queries.Additionally,DeepSeek-R1 demonstrates outstanding performance on tasks
33、requiringlong-context understanding,substantially outperforming DeepSeek-V3 on long-contextbenchmarks.2.Approach2.1.OverviewPrevious work has heavily relied on large amounts of supervised data to enhance modelperformance.In this study,we demonstrate that reasoning capabilities can be significantlyim
34、proved through large-scale reinforcement learning(RL),even without using supervisedfine-tuning(SFT)as a cold start.Furthermore,performance can be further enhanced withthe inclusion of a small amount of cold-start data.In the following sections,we present:(1)DeepSeek-R1-Zero,which applies RL directly
35、 to the base model without any SFT data,and(2)DeepSeek-R1,which applies RL starting from a checkpoint fine-tuned with thousands oflong Chain-of-Thought(CoT)examples.3)Distill the reasoning capability from DeepSeek-R1 tosmall dense models.2.2.DeepSeek-R1-Zero:Reinforcement Learning on the Base ModelR
36、einforcement learning has demonstrated significant effectiveness in reasoning tasks,as ev-idenced by our previous works(Shao et al.,2024;Wang et al.,2023).However,these worksheavily depended on supervised data,which are time-intensive to gather.In this section,weexplore the potential of LLMs to deve
37、lop reasoning capabilities without any supervised data,focusing on their self-evolution through a pure reinforcement learning process.We start with abrief overview of our RL algorithm,followed by the presentation of some exciting results,andhope this provides the community with valuable insights.2.2
38、.1.Reinforcement Learning AlgorithmGroup Relative Policy OptimizationIn order to save the training costs of RL,we adopt GroupRelative Policy Optimization(GRPO)(Shao et al.,2024),which foregoes the critic model that istypically the same size as the policy model,and estimates the baseline from group s
39、cores instead.Specifically,for each question,GRPO samples a group of outputs1,2,from the oldpolicyand then optimizes the policy modelby maximizing the following objective:J()=E(),=1(|)1=1?min?(|)(|),clip?(|)(|),1,1+?D?|?,(1)D?|?=(|)(|)log(|)(|)1,(2)whereandare hyper-parameters,andis the advantage,co
40、mputed using a group ofrewards 1,2,.,corresponding to the outputs within each group:=m(1,2,)s(1,2,).(3)5A conversation between User and Assistant.The user asks a question,and the Assistant solves it.The assistant first thinks about the reasoning process in the mind and then provides the userwith the
41、 answer.The reasoning process and answer are enclosed within and tags,respectively,i.e.,reasoning process here answer here.User:prompt.Assistant:Table 1|Template for DeepSeek-R1-Zero.prompt will be replaced with the specific reasoningquestion during training.2.2.2.Reward ModelingThe reward is the so
42、urce of the training signal,which decides the optimization direction of RL.To train DeepSeek-R1-Zero,we adopt a rule-based reward system that mainly consists of twotypes of rewards:Accuracy rewards:The accuracy reward model evaluates whether the response is correct.For example,in the case of math pr
43、oblems with deterministic results,the model is requiredto provide the final answer in a specified format(e.g.,within a box),enabling reliablerule-based verification of correctness.Similarly,for LeetCode problems,a compiler can beused to generate feedback based on predefined test cases.Format rewards
44、:In addition to the accuracy reward model,we employ a format rewardmodel that enforces the model to put its thinking process between and tags.We do not apply the outcome or process neural reward model in developing DeepSeek-R1-Zero,because we find that the neural reward model may suffer from reward
45、hacking in the large-scalereinforcement learning process,and retraining the reward model needs additional trainingresources and it complicates the whole training pipeline.2.2.3.Training TemplateTo train DeepSeek-R1-Zero,we begin by designing a straightforward template that guidesthe base model to ad
46、here to our specified instructions.As depicted in Table 1,this templaterequires DeepSeek-R1-Zero to first produce a reasoning process,followed by the final answer.We intentionally limit our constraints to this structural format,avoiding any content-specificbiasessuch as mandating reflective reasonin
47、g or promoting particular problem-solving strate-giesto ensure that we can accurately observe the models natural progression during the RLprocess.2.2.4.Performance,Self-evolution Process and Aha Moment of DeepSeek-R1-ZeroPerformance of DeepSeek-R1-ZeroFigure 2 depicts the performance trajectory of D
48、eepSeek-R1-Zero on the AIME 2024 benchmark throughout the RL training process.As illustrated,DeepSeek-R1-Zero demonstrates a steady and consistent enhancement in performance as theRL training advances.Notably,the average pass1 score on AIME 2024 shows a significantincrease,jumping from an initial 15
49、.6%to an impressive 71.0%,reaching performance levelscomparable to OpenAI-o1-0912.This significant improvement highlights the efficacy of our RLalgorithm in optimizing the models performance over time.Table 2 provides a comparative analysis between DeepSeek-R1-Zero and OpenAIs o1-0912models across a
50、 variety of reasoning-related benchmarks.The findings reveal that RL empowers6ModelAIME 2024MATH-500GPQALiveCodeCodeForcesDiamondBenchpass1cons64pass1pass1pass1ratingOpenAI-o1-mini63.680.090.060.053.81820OpenAI-o1-091274.483.394.877.363.41843DeepSeek-R1-Zero71.086.795.973.350.01444Table 2|Comparison
51、 of DeepSeek-R1-Zero and OpenAI o1 models on reasoning-relatedbenchmarks.Figure 2|AIME accuracy of DeepSeek-R1-Zero during training.For each question,we sample16 responses and calculate the overall average accuracy to ensure a stable evaluation.DeepSeek-R1-Zero to attain robust reasoning capabilitie
52、s without the need for any supervisedfine-tuning data.This is a noteworthy achievement,as it underscores the models ability tolearn and generalize effectively through RL alone.Additionally,the performance of DeepSeek-R1-Zero can be further augmented through the application of majority voting.For exa
53、mple,when majority voting is employed on the AIME benchmark,DeepSeek-R1-Zeros performanceescalates from 71.0%to 86.7%,thereby exceeding the performance of OpenAI-o1-0912.Theability of DeepSeek-R1-Zero to achieve such competitive performance,both with and withoutmajority voting,highlights its strong
54、foundational capabilities and its potential for furtheradvancements in reasoning tasks.Self-evolution Process of DeepSeek-R1-ZeroThe self-evolution process of DeepSeek-R1-Zerois a fascinating demonstration of how RL can drive a model to improve its reasoning capabilitiesautonomously.By initiating RL
55、 directly from the base model,we can closely monitor the modelsprogression without the influence of the supervised fine-tuning stage.This approach providesa clear view of how the model evolves over time,particularly in terms of its ability to handlecomplex reasoning tasks.As depicted in Figure 3,the
56、 thinking time of DeepSeek-R1-Zero shows consistent improve-7Figure 3|The average response length of DeepSeek-R1-Zero on the training set during the RLprocess.DeepSeek-R1-Zero naturally learns to solve reasoning tasks with more thinking time.ment throughout the training process.This improvement is n
57、ot the result of external adjustmentsbut rather an intrinsic development within the model.DeepSeek-R1-Zero naturally acquires theability to solve increasingly complex reasoning tasks by leveraging extended test-time compu-tation.This computation ranges from generating hundreds to thousands of reason
58、ing tokens,allowing the model to explore and refine its thought processes in greater depth.One of the most remarkable aspects of this self-evolution is the emergence of sophisticatedbehaviors as the test-time computation increases.Behaviors such as reflectionwhere the modelrevisits and reevaluates i
59、ts previous stepsand the exploration of alternative approaches toproblem-solvingarisespontaneously.Thesebehaviorsarenotexplicitlyprogrammedbutinsteademerge as a result of the models interaction with the reinforcement learning environment.Thisspontaneous development significantly enhances DeepSeek-R1
60、-Zeros reasoning capabilities,enabling it to tackle more challenging tasks with greater efficiency and accuracy.Aha Moment of DeepSeek-R1-ZeroA particularly intriguing phenomenon observed duringthe training of DeepSeek-R1-Zero is the occurrence of an“aha moment”.This moment,asillustrated in Table 3,
61、occurs in an intermediate version of the model.During this phase,DeepSeek-R1-Zero learns to allocate more thinking time to a problem by reevaluating its initialapproach.This behavior is not only a testament to the models growing reasoning abilitiesbut also a captivating example of how reinforcement
62、learning can lead to unexpected andsophisticated outcomes.This moment is not only an“aha moment”for the model but also for the researchersobserving its behavior.It underscores the power and beauty of reinforcement learning:ratherthan explicitly teaching the model on how to solve a problem,we simply
63、provide it with theright incentives,and it autonomously develops advanced problem-solving strategies.The“aha moment”serves as a powerful reminder of the potential of RL to unlock new levels ofintelligence in artificial systems,paving the way for more autonomous and adaptive models inthe future.8Ques
64、tion:If 1,then the sum of the real solutions of+=is equal toResponse:To solve the equation+=,lets start by squaring both?+?2=2=+=2.Rearrange to isolate the inner square root term:(2)2=+=222+(2)2=+=422+(2)=0.Wait,wait.Wait.Thats an aha moment I can flag here.Lets reevaluate this step-by-step to ident
65、ify if the correct sum can be We started with the equation:+=First,lets square both sides:+=2=+=2Next,I could square both sides again,treating the equation:.Table 3|An interesting“aha moment”of an intermediate version of DeepSeek-R1-Zero.Themodel learns to rethink using an anthropomorphic tone.This
66、is also an aha moment for us,allowing us to witness the power and beauty of reinforcement learning.Drawback of DeepSeek-R1-ZeroAlthough DeepSeek-R1-Zero exhibits strong reasoningcapabilities and autonomously develops unexpected and powerful reasoning behaviors,it facesseveral issues.For instance,Dee
67、pSeek-R1-Zero struggles with challenges like poor readability,and language mixing.To make reasoning processes more readable and share them with theopen community,we explore DeepSeek-R1,a method that utilizes RL with human-friendlycold-start data.2.3.DeepSeek-R1:Reinforcement Learning with Cold Start
68、Inspired by the promising results of DeepSeek-R1-Zero,two natural questions arise:1)Canreasoning performance be further improved or convergence accelerated by incorporating a smallamount of high-quality data as a cold start?2)How can we train a user-friendly model thatnot only produces clear and coh
69、erent Chains of Thought(CoT)but also demonstrates stronggeneral capabilities?To address these questions,we design a pipeline to train DeepSeek-R1.Thepipeline consists of four stages,outlined as follows.2.3.1.Cold StartUnlike DeepSeek-R1-Zero,to prevent the early unstable cold start phase of RL train
70、ing fromthe base model,for DeepSeek-R1 we construct and collect a small amount of long CoT datato fine-tune the model as the initial RL actor.To collect such data,we have explored severalapproaches:using few-shot prompting with a long CoT as an example,directly promptingmodels to generate detailed a
71、nswers with reflection and verification,gathering DeepSeek-R1-Zero outputs in a readable format,and refining the results through post-processing by humanannotators.In this work,we collect thousands of cold-start data to fine-tune the DeepSeek-V3-Base asthe starting point for RL.Compared to DeepSeek-
72、R1-Zero,the advantages of cold start data9include:Readability:A key limitation of DeepSeek-R1-Zero is that its content is often not suitablefor reading.Responses may mix multiple languages or lack markdown formatting tohighlight answers for users.In contrast,when creating cold-start data for DeepSee
73、k-R1,we design a readable pattern that includes a summary at the end of each response andfilters out responses that are not reader-friendly.Here,we define the output format as|special_token|special_token|,where the reasoningprocess is the CoT for the query,and the summary is used to summarize the re
74、asoningresults.Potential:By carefully designing the pattern for cold-start data with human priors,weobserve better performance against DeepSeek-R1-Zero.We believe the iterative training isa better way for reasoning models.2.3.2.Reasoning-oriented Reinforcement LearningAfter fine-tuning DeepSeek-V3-B
75、ase on the cold start data,we apply the same large-scalereinforcement learning training process as employed in DeepSeek-R1-Zero.This phase focuseson enhancing the models reasoning capabilities,particularly in reasoning-intensive tasks suchas coding,mathematics,science,and logic reasoning,which invol
76、ve well-defined problems withclear solutions.During the training process,we observe that CoT often exhibits language mixing,particularly when RL prompts involve multiple languages.To mitigate the issue of languagemixing,we introduce a language consistency reward during RL training,which is calculate
77、das the proportion of target language words in the CoT.Although ablation experiments showthat such alignment results in a slight degradation in the models performance,this rewardaligns with human preferences,making it more readable.Finally,we combine the accuracy ofreasoning tasks and the reward for
78、 language consistency by directly summing them to form thefinal reward.We then apply RL training on the fine-tuned model until it achieves convergenceon reasoning tasks.2.3.3.Rejection Sampling and Supervised Fine-TuningWhen reasoning-oriented RL converges,we utilize the resulting checkpoint to coll
79、ect SFT(Supervised Fine-Tuning)data for the subsequent round.Unlike the initial cold-start data,whichprimarily focuses on reasoning,this stage incorporates data from other domains to enhance themodels capabilities in writing,role-playing,and other general-purpose tasks.Specifically,wegenerate the da
80、ta and fine-tune the model as described below.Reasoning dataWe curate reasoning prompts and generate reasoning trajectories by perform-ing rejection sampling from the checkpoint from the above RL training.In the previous stage,we only included data that could be evaluated using rule-based rewards.Ho
81、wever,in this stage,we expand the dataset by incorporating additional data,some of which use a generative rewardmodel by feeding the ground-truth and model predictions into DeepSeek-V3 for judgment.Additionally,because the model output is sometimes chaotic and difficult to read,we havefiltered out c
82、hain-of-thought with mixed languages,long parapraphs,and code blocks.Foreach prompt,we sample multiple responses and retain only the correct ones.In total,we collectabout 600k reasoning related training samples.10Non-Reasoning dataFor non-reasoning data,such as writing,factual QA,self-cognition,and
83、translation,we adopt the DeepSeek-V3 pipeline and reuse portions of the SFT dataset ofDeepSeek-V3.For certain non-reasoning tasks,we call DeepSeek-V3 to generate a potentialchain-of-thought before answering the question by prompting.However,for simpler queries,such as“hello”we do not provide a CoT i
84、n response.In the end,we collected a total ofapproximately 200k training samples that are unrelated to reasoning.We fine-tune DeepSeek-V3-Base for two epochs using the above curated dataset of about800k samples.2.3.4.Reinforcement Learning for all ScenariosTo further align the model with human prefe
85、rences,we implement a secondary reinforcementlearning stage aimed at improving the models helpfulness and harmlessness while simultane-ously refining its reasoning capabilities.Specifically,we train the model using a combinationof reward signals and diverse prompt distributions.For reasoning data,we
86、 adhere to themethodology outlined in DeepSeek-R1-Zero,which utilizes rule-based rewards to guide thelearning process in math,code,and logical reasoning domains.For general data,we resort toreward models to capture human preferences in complex and nuanced scenarios.We buildupon the DeepSeek-V3 pipel
87、ine and adopt a similar distribution of preference pairs and train-ing prompts.For helpfulness,we focus exclusively on the final summary,ensuring that theassessment emphasizes the utility and relevance of the response to the user while minimizinginterference with the underlying reasoning process.For
88、 harmlessness,we evaluate the entireresponse of the model,including both the reasoning process and the summary,to identify andmitigate any potential risks,biases,or harmful content that may arise during the generationprocess.Ultimately,the integration of reward signals and diverse data distributions
89、 enables usto train a model that excels in reasoning while prioritizing helpfulness and harmlessness.2.4.Distillation:Empower Small Models with Reasoning CapabilityTo equip more efficient smaller models with reasoning capabilities like DeepSeek-R1,we directlyfine-tuned open-source models like Qwen(Q
90、wen,2024b)and Llama(AIMeta,2024)usingthe 800k samples curated with DeepSeek-R1,as detailed in 2.3.3.Our findings indicate thatthis straightforward distillation method significantly enhances the reasoning abilities of smallermodels.The base models we use here are Qwen2.5-Math-1.5B,Qwen2.5-Math-7B,Qwe
91、n2.5-14B,Qwen2.5-32B,Llama-3.1-8B,and Llama-3.3-70B-Instruct.We select Llama-3.3 because itsreasoning capability is slightly better than that of Llama-3.1.For distilled models,we apply only SFT and do not include an RL stage,even thoughincorporating RL could substantially boost model performance.Our
92、 primary goal here is todemonstrate the effectiveness of the distillation technique,leaving the exploration of the RLstage to the broader research community.3.ExperimentBenchmarksWe evaluate models on MMLU(Hendrycks et al.,2020),MMLU-Redux(Gemaet al.,2024),MMLU-Pro(Wang et al.,2024),C-Eval(Huang et
93、al.,2023),and CMMLU(Li et al.,2023),IFEval(Zhou et al.,2023),FRAMES(Krishna et al.,2024),GPQA Diamond(Rein et al.,2023),SimpleQA(OpenAI,2024c),C-SimpleQA(He et al.,2024),SWE-Bench Verified(OpenAI,112024d),Aider1,LiveCodeBench(Jain et al.,2024)(2024-08 2025-01),Codeforces2,ChineseNational High School
94、 Mathematics Olympiad(CNMO 2024)3,and American Invitational Math-ematics Examination 2024(AIME 2024)(MAA,2024).In addition to standard benchmarks,wealso evaluate our models on open-ended generation tasks using LLMs as judges.Specifically,weadhere to the original configurations of AlpacaEval 2.0(Dubo
95、is et al.,2024)and Arena-Hard(Liet al.,2024),which leverage GPT-4-Turbo-1106 as judges for pairwise comparisons.Here,weonly feed the final summary to evaluation to avoid the length bias.For distilled models,wereport representative results on AIME 2024,MATH-500,GPQA Diamond,Codeforces,andLiveCodeBenc
96、h.Evaluation PromptsFollowing the setup in DeepSeek-V3,standard benchmarks such asMMLU,DROP,GPQA Diamond,and SimpleQA are evaluated using prompts from the simple-evals framework.For MMLU-Redux,we adopt the Zero-Eval prompt format(Lin,2024)in azero-shot setting.In terms of MMLU-Pro,C-Eval and CLUE-WS
97、C,since the original promptsare few-shot,we slightly modify the prompt to the zero-shot setting.The CoT in few-shotmay hurt the performance of DeepSeek-R1.Other datasets follow their original evaluationprotocols with default prompts provided by their creators.For code and math benchmarks,theHumanEva
98、l-Mul dataset covers eight mainstream programming languages(Python,Java,C+,C#,JavaScript,TypeScript,PHP,and Bash).Model performance on LiveCodeBench is evaluatedusing CoT format,with data collected between August 2024 and January 2025.The Codeforcesdataset is evaluated using problems from 10 Div.2 c
99、ontests along with expert-crafted test cases,after which the expected ratings and percentages of competitors are calculated.SWE-Benchverified results are obtained via the agentless framework(Xia et al.,2024).AIDER-relatedbenchmarks are measured using a diff format.DeepSeek-R1 outputs are capped at a
100、 maximumof 32,768 tokens for each benchmark.BaselinesWe conduct comprehensive evaluations against several strong baselines,includingDeepSeek-V3,Claude-Sonnet-3.5-1022,GPT-4o-0513,OpenAI-o1-mini,and OpenAI-o1-1217.Since accessing the OpenAI-o1-1217 API is challenging in mainland China,we report its p
101、erfor-mance based on official reports.For distilled models,we also compare the open-source modelQwQ-32B-Preview(Qwen,2024a).Evaluation SetupWe set the maximum generation length to 32,768 tokens for the models.We found that using greedy decoding to evaluate long-output reasoning models results inhigh
102、er repetition rates and significant variability across different checkpoints.Therefore,wedefault to passevaluation(Chen et al.,2021)and report pass1 using a non-zero temperature.Specifically,we use a sampling temperature of 0.6 and a top-value of 0.95 to generateresponses(typically between 4 and 64,
103、depending on the test set size)for each question.Pass1is then calculated aspass1=1=1,wheredenotes the correctness of the-th response.This method provides more reliableperformance estimates.For AIME 2024,we also report consensus(majority vote)results(Wanget al.,2022)using 64 samples,denoted as cons64
104、.1https:/aider.chat2https:/3https:/ EvaluationBenchmark(Metric)Claude-3.5-GPT-4o DeepSeek OpenAI OpenAI DeepSeekSonnet-10220513V3o1-mini o1-1217R1Architecture-MoE-MoE#Activated Params-37B-37B#Total Params-671B-671BEnglishMMLU(Pass1)88.387.288.585.291.890.8MMLU-Redux(EM)88.988.089.186.7-92.9MMLU-Pro(
105、EM)78.072.675.980.3-84.0DROP(3-shot F1)88.383.791.683.990.292.2IF-Eval(Prompt Strict)86.584.386.184.8-83.3GPQA Diamond(Pass1)65.049.959.160.075.771.5SimpleQA(Correct)28.438.224.97.047.030.1FRAMES(Acc.)72.580.573.376.9-82.5AlpacaEval2.0(LC-winrate)52.051.170.057.8-87.6ArenaHard(GPT-4-1106)85.280.485.
106、592.0-92.3CodeLiveCodeBench(Pass1-COT)38.932.936.253.863.465.9Codeforces(Percentile)20.323.658.793.496.696.3Codeforces(Rating)7177591134182020612029SWE Verified(Resolved)50.838.842.041.648.949.2Aider-Polyglot(Acc.)45.316.049.632.961.753.3MathAIME 2024(Pass1)16.09.339.263.679.279.8MATH-500(Pass1)78.3
107、74.690.290.096.497.3CNMO 2024(Pass1)13.110.843.267.6-78.8ChineseCLUEWSC(EM)85.487.990.989.9-92.8C-Eval(EM)76.776.086.568.9-91.8C-SimpleQA(Correct)55.458.768.040.3-63.7Table 4|Comparison between DeepSeek-R1 and other representative models.For education-oriented knowledge benchmarks such as MMLU,MMLU-
108、Pro,and GPQADiamond,DeepSeek-R1 demonstrates superior performance compared to DeepSeek-V3.This im-provement is primarily attributed to enhanced accuracy in STEM-related questions,where signif-icant gains are achieved through large-scale reinforcement learning.Additionally,DeepSeek-R1excels on FRAMES
109、,a long-context-dependent QA task,showcasing its strong document analysiscapabilities.This highlights the potential of reasoning models in AI-driven search and dataanalysis tasks.On the factual benchmark SimpleQA,DeepSeek-R1 outperforms DeepSeek-V3,demonstrating its capability in handling fact-based
110、 queries.A similar trend is observed whereOpenAI-o1 surpasses GPT-4o on this benchmark.However,DeepSeek-R1 performs worse thanDeepSeek-V3 on the Chinese SimpleQA benchmark,primarily due to its tendency to refuseanswering certain queries after safety RL.Without safety RL,DeepSeek-R1 could achieve ana
111、ccuracy of over 70%.DeepSeek-R1 also delivers impressive results on IF-Eval,a benchmark designed to assess amodels ability to follow format instructions.These improvements can be linked to the inclusionof instruction-following data during the final stages of supervised fine-tuning(SFT)and RLtraining
112、.Furthermore,remarkable performance is observed on AlpacaEval2.0 and ArenaHard,indicating DeepSeek-R1s strengths in writing tasks and open-domain question answering.Itssignificant outperformance of DeepSeek-V3 underscores the generalization benefits of large-scaleRL,which not only boosts reasoning c
113、apabilities but also improves performance across diversedomains.Moreover,the summary lengths generated by DeepSeek-R1 are concise,with anaverage of 689 tokens on ArenaHard and 2,218 characters on AlpacaEval 2.0.This indicates that13DeepSeek-R1 avoids introducing length bias during GPT-based evaluati
114、ons,further solidifyingits robustness across multiple tasks.On math tasks,DeepSeek-R1 demonstrates performance on par with OpenAI-o1-1217,surpassing other models by a large margin.A similar trend is observed on coding algorithmtasks,such as LiveCodeBench and Codeforces,where reasoning-focused models
115、 dominate thesebenchmarks.On engineering-oriented coding tasks,OpenAI-o1-1217 outperforms DeepSeek-R1on Aider but achieves comparable performance on SWE Verified.We believe the engineeringperformance of DeepSeek-R1 will improve in the next version,as the amount of related RLtraining data currently r
116、emains very limited.3.2.Distilled Model EvaluationModelAIME 2024MATH-500GPQALiveCodeCodeForcesDiamondBenchpass1cons64pass1pass1pass1ratingGPT-4o-05139.313.474.649.932.9759Claude-3.5-Sonnet-102216.026.778.365.038.9717OpenAI-o1-mini63.680.090.060.053.81820QwQ-32B-Preview50.060.090.654.541.91316DeepSee
117、k-R1-Distill-Qwen-1.5B28.952.783.933.816.9954DeepSeek-R1-Distill-Qwen-7B55.583.392.849.137.61189DeepSeek-R1-Distill-Qwen-14B69.780.093.959.153.11481DeepSeek-R1-Distill-Qwen-32B72.683.394.362.157.21691DeepSeek-R1-Distill-Llama-8B50.480.089.149.039.61205DeepSeek-R1-Distill-Llama-70B70.086.794.565.257.
118、51633Table 5|Comparison of DeepSeek-R1 distilled models and other comparable models onreasoning-related benchmarks.As shown in Table 5,simply distilling DeepSeek-R1s outputs enables the efficient DeepSeek-R1-7B(i.e.,DeepSeek-R1-Distill-Qwen-7B,abbreviated similarly below)to outperform non-reasoning
119、models like GPT-4o-0513 across the board.DeepSeek-R1-14B surpasses QwQ-32B-Preview on all evaluation metrics,while DeepSeek-R1-32B and DeepSeek-R1-70B significantlyexceed o1-mini on most benchmarks.These results demonstrate the strong potential of distilla-tion.Additionally,we found that applying RL
120、 to these distilled models yields significant furthergains.We believe this warrants further exploration and therefore present only the results of thesimple SFT-distilled models here.4.Discussion4.1.Distillation v.s.Reinforcement LearningIn Section 3.2,we can see that by distilling DeepSeek-R1,the sm
121、all model can achieve impressiveresults.However,there is still one question left:can the model achieve comparable performancethrough the large-scale RL training discussed in the paper without distillation?To answer this question,we conduct large-scale RL training on Qwen-32B-Base using math,code,and
122、 STEM data,training for over 10K steps,resulting in DeepSeek-R1-Zero-Qwen-32B.Theexperimental results,shown in Table 6,demonstrate that the 32B base model,after large-scale14ModelAIME 2024MATH-500GPQA DiamondLiveCodeBenchpass1cons64pass1pass1pass1QwQ-32B-Preview50.060.090.654.541.9DeepSeek-R1-Zero-Q
123、wen-32B47.060.091.655.040.2DeepSeek-R1-Distill-Qwen-32B72.683.394.362.157.2Table 6|Comparison of distilled and RL Models on Reasoning-Related Benchmarks.RL training,achieves performance on par with QwQ-32B-Preview.However,DeepSeek-R1-Distill-Qwen-32B,which is distilled from DeepSeek-R1,performs sign
124、ificantly better thanDeepSeek-R1-Zero-Qwen-32B across all benchmarks.Therefore,we can draw two conclusions:First,distilling more powerful models into smallerones yields excellent results,whereas smaller models relying on the large-scale RL mentioned inthis paper require enormous computational power
125、and may not even achieve the performanceof distillation.Second,while distillation strategies are both economical and effective,advancingbeyond the boundaries of intelligence may still require more powerful base models and larger-scale reinforcement learning.4.2.Unsuccessful AttemptsIn the early stag
126、es of developing DeepSeek-R1,we also encountered failures and setbacks alongthe way.We share our failure experiences here to provide insights,but this does not imply thatthese approaches are incapable of developing effective reasoning models.Process Reward Model(PRM)PRM is a reasonable method to gui
127、de the model toward betterapproaches for solving reasoning tasks(Lightman et al.,2023;Uesato et al.,2022;Wang et al.,2023).However,in practice,PRM has three main limitations that may hinder its ultimate suc-cess.First,it is challenging to explicitly define a fine-grain step in general reasoning.Seco
128、nd,determining whether the current intermediate step is correct is a challenging task.Automatedannotation using models may not yield satisfactory results,while manual annotation is not con-ducive to scaling up.Third,once a model-based PRM is introduced,it inevitably leads to rewardhacking(Gao et al.
129、,2022),and retraining the reward model needs additional training resourcesand it complicates the whole training pipeline.In conclusion,while PRM demonstrates a goodability to rerank the top-N responses generated by the model or assist in guided search(Snellet al.,2024),its advantages are limited com
130、pared to the additional computational overhead itintroduces during the large-scale reinforcement learning process in our experiments.MonteCarloTreeSearch(MCTS)InspiredbyAlphaGo(Silveretal.,2017b)andAlphaZero(Sil-ver et al.,2017a),we explored using Monte Carlo Tree Search(MCTS)to enhance test-timecom
131、pute scalability.This approach involves breaking answers into smaller parts to allow themodel to explore the solution space systematically.To facilitate this,we prompt the model togenerate multiple tags that correspond to specific reasoning steps necessary for the search.Fortraining,we first use col
132、lected prompts to find answers via MCTS guided by a pre-trained valuemodel.Subsequently,we use the resulting question-answer pairs to train both the actor modeland the value model,iteratively refining the process.However,this approach encounters several challenges when scaling up the training.First,
133、unlike chess,where the search space is relatively well-defined,token generation presents an15exponentially larger search space.To address this,we set a maximum extension limit for eachnode,but this can lead to the model getting stuck in local optima.Second,the value modeldirectly influences the qual
134、ity of generation since it guides each step of the search process.Training a fine-grained value model is inherently difficult,which makes it challenging for themodel to iteratively improve.While AlphaGos core success relied on training a value model toprogressively enhance its performance,this princ
135、iple proves difficult to replicate in our setupdue to the complexities of token generation.In conclusion,while MCTS can improve performance during inference when paired with apre-trained value model,iteratively boosting model performance through self-search remains asignificant challenge.5.Conclusio
136、n,Limitations,and Future WorkIn this work,we share our journey in enhancing model reasoning abilities through reinforcementlearning.DeepSeek-R1-Zero represents a pure RL approach without relying on cold-startdata,achieving strong performance across various tasks.DeepSeek-R1 is more powerful,leveragi
137、ng cold-start data alongside iterative RL fine-tuning.Ultimately,DeepSeek-R1 achievesperformance comparable to OpenAI-o1-1217 on a range of tasks.We further explore distillation the reasoning capability to small dense models.We useDeepSeek-R1 as the teacher model to generate 800K training samples,an
138、d fine-tune several smalldense models.The results are promising:DeepSeek-R1-Distill-Qwen-1.5B outperforms GPT-4oand Claude-3.5-Sonnet on math benchmarks with 28.9%on AIME and 83.9%on MATH.Otherdense models also achieve impressive results,significantly outperforming other instruction-tuned models bas
139、ed on the same underlying checkpoints.In the future,we plan to invest in research across the following directions for DeepSeek-R1.General Capability:Currently,the capabilities of DeepSeek-R1 fall short of DeepSeek-V3in tasks such as function calling,multi-turn,complex role-playing,and JSON output.Mo
140、ving forward,we plan to explore how long CoT can be leveraged to enhance tasks inthese fields.Language Mixing:DeepSeek-R1 is currently optimized for Chinese and English,whichmay result in language mixing issues when handling queries in other languages.Forinstance,DeepSeek-R1 might use English for re
141、asoning and responses,even if the query isin a language other than English or Chinese.We aim to address this limitation in futureupdates.Prompting Engineering:When evaluating DeepSeek-R1,we observe that it is sensitiveto prompts.Few-shot prompting consistently degrades its performance.Therefore,were
142、commend users directly describe the problem and specify the output format using azero-shot setting for optimal results.Software Engineering Tasks:Due to the long evaluation times,which impact the effi-ciency of the RL process,large-scale RL has not been applied extensively in softwareengineering tas
143、ks.As a result,DeepSeek-R1 has not demonstrated a huge improvementover DeepSeek-V3 on software engineering benchmarks.Future versions will addressthis by implementing rejection sampling on software engineering data or incorporatingasynchronous evaluations during the RL process to improve efficiency.
144、16ReferencesAIMeta.Llama 3.1 model card,2024.URLhttps:/ 3.5 sonnet,2024.URLhttps:/ Oliveira Pinto,J.Kaplan,H.Edwards,Y.Burda,N.Joseph,G.Brockman,A.Ray,R.Puri,G.Krueger,M.Petrov,H.Khlaaf,G.Sastry,P.Mishkin,B.Chan,S.Gray,N.Ryder,M.Pavlov,A.Power,L.Kaiser,M.Bavarian,C.Winter,P.Tillet,F.P.Such,D.Cumming
145、s,M.Plappert,F.Chantzis,E.Barnes,A.Herbert-Voss,W.H.Guss,A.Nichol,A.Paino,N.Tezak,J.Tang,I.Babuschkin,S.Balaji,S.Jain,W.Saunders,C.Hesse,A.N.Carr,J.Leike,J.Achiam,V.Misra,E.Morikawa,A.Radford,M.Knight,M.Brundage,M.Murati,K.Mayer,P.Welinder,B.McGrew,D.Amodei,S.McCandlish,I.Sutskever,andW.Zaremba.Eval
146、uating large language models trained on code.CoRR,abs/2107.03374,2021.URLhttps:/arxiv.org/abs/2107.03374.A.Dubey,A.Jauhri,A.Pandey,A.Kadian,A.Al-Dahle,A.Letman,A.Mathur,A.Schelten,A.Yang,A.Fan,et al.The llama 3 herd of models.arXiv preprint arXiv:2407.21783,2024.Y.Dubois,B.Galambosi,P.Liang,and T.B.
147、Hashimoto.Length-controlled alpacaeval:A simpleway to debias automatic evaluators.arXiv preprint arXiv:2404.04475,2024.X.Feng,Z.Wan,M.Wen,S.M.McAleer,Y.Wen,W.Zhang,and J.Wang.Alphazero-liketree-search can guide large language model decoding and training,2024.URLhttps:/arxiv.org/abs/2309.17179.L.Gao,
148、J.Schulman,and J.Hilton.Scaling laws for reward model overoptimization,2022.URLhttps:/arxiv.org/abs/2210.10760.A.P.Gema,J.O.J.Leang,G.Hong,A.Devoto,A.C.M.Mancino,R.Saxena,X.He,Y.Zhao,X.Du,M.R.G.Madani,C.Barale,R.McHardy,J.Harris,J.Kaddour,E.van Krieken,andP.Minervini.Are we done with mmlu?CoRR,abs/2
149、406.04127,2024.URLhttps:/doi.org/10.48550/arXiv.2406.04127.Google.Our next-generation model:Gemini 1.5,2024.URLhttps:/blog.google/technology/ai/google-gemini-next-generation-model-february-2024.Y.He,S.Li,J.Liu,Y.Tan,W.Wang,H.Huang,X.Bu,H.Guo,C.Hu,B.Zheng,et al.Chi-nese simpleqa:A chinese factuality
150、evaluation for large language models.arXiv preprintarXiv:2411.07140,2024.D.Hendrycks,C.Burns,S.Basart,A.Zou,M.Mazeika,D.Song,and J.Steinhardt.Measuringmassive multitask language understanding.arXiv preprint arXiv:2009.03300,2020.Y.Huang,Y.Bai,Z.Zhu,J.Zhang,J.Zhang,T.Su,J.Liu,C.Lv,Y.Zhang,J.Lei,et al
151、.C-Eval:Amulti-level multi-discipline chinese evaluation suite for foundation models.arXiv preprintarXiv:2305.08322,2023.N.Jain,K.Han,A.Gu,W.Li,F.Yan,T.Zhang,S.Wang,A.Solar-Lezama,K.Sen,and I.Stoica.Livecodebench:Holistic and contamination free evaluation of large language models for code.CoRR,abs/2
152、403.07974,2024.URLhttps:/doi.org/10.48550/arXiv.2403.07974.17S.Krishna,K.Krishna,A.Mohananey,S.Schwarcz,A.Stambler,S.Upadhyay,and M.Faruqui.Fact,fetch,and reason:A unified evaluation of retrieval-augmented generation.CoRR,abs/2409.12941,2024.doi:10.48550/ARXIV.2409.12941.URLhttps:/doi.org/10.48550/a
153、rXiv.2409.12941.A.Kumar,V.Zhuang,R.Agarwal,Y.Su,J.D.Co-Reyes,A.Singh,K.Baumli,S.Iqbal,C.Bishop,R.Roelofs,et al.Training language models to self-correct via reinforcement learning.arXivpreprint arXiv:2409.12917,2024.H.Li,Y.Zhang,F.Koto,Y.Yang,H.Zhao,Y.Gong,N.Duan,and T.Baldwin.CMMLU:Measur-ing massiv
154、e multitask language understanding in Chinese.arXiv preprint arXiv:2306.09212,2023.T.Li,W.-L.Chiang,E.Frick,L.Dunlap,T.Wu,B.Zhu,J.E.Gonzalez,and I.Stoica.Fromcrowdsourced data to high-quality benchmarks:Arena-hard and benchbuilder pipeline.arXivpreprint arXiv:2406.11939,2024.H.Lightman,V.Kosaraju,Y.
155、Burda,H.Edwards,B.Baker,T.Lee,J.Leike,J.Schulman,I.Sutskever,and K.Cobbe.Lets verify step by step.arXiv preprint arXiv:2305.20050,2023.B.Y.Lin.ZeroEval:A Unified Framework for Evaluating Language Models,July 2024.URLhttps:/ invitational mathematics examination-aime.InAmerican InvitationalMathematics
156、 Examination-AIME 2024,February 2024.URLhttps:/maa.org/math-competitions/american-invitational-mathematics-examination-aime.OpenAI.Hello GPT-4o,2024a.URLhttps:/ to reason with llms,2024b.URLhttps:/ SimpleQA,2024c.URLhttps:/ SWE-bench verified were releasing a human-validated subset of swe-bench that
157、 more,2024d.URLhttps:/ deeply on the boundaries of the unknown,2024a.URLhttps:/qwenlm.github.io/blog/qwq-32b-preview/.Qwen.Qwen2.5:A party of foundation models,2024b.URLhttps:/qwenlm.github.io/blog/qwen2.5.D.Rein,B.L.Hou,A.C.Stickland,J.Petty,R.Y.Pang,J.Dirani,J.Michael,and S.R.Bowman.GPQA:A graduat
158、e-level google-proof q&a benchmark.arXiv preprint arXiv:2311.12022,2023.Z.Shao,P.Wang,Q.Zhu,R.Xu,J.Song,M.Zhang,Y.Li,Y.Wu,and D.Guo.Deepseekmath:Pushing the limits of mathematical reasoning in open language models.arXiv preprintarXiv:2402.03300,2024.D.Silver,T.Hubert,J.Schrittwieser,I.Antonoglou,M.L
159、ai,A.Guez,M.Lanctot,L.Sifre,D.Kumaran,T.Graepel,T.P.Lillicrap,K.Simonyan,and D.Hassabis.Mastering chess andshogi by self-play with a general reinforcement learning algorithm.CoRR,abs/1712.01815,2017a.URLhttp:/arxiv.org/abs/1712.01815.18D.Silver,J.Schrittwieser,K.Simonyan,I.Antonoglou,A.Huang,A.Guez,
160、T.Hubert,L.Baker,M.Lai,A.Bolton,Y.Chen,T.P.Lillicrap,F.Hui,L.Sifre,G.van den Driessche,T.Graepel,andD.Hassabis.Mastering the game of go without human knowledge.Nat.,550(7676):354359,2017b.doi:10.1038/NATURE24270.URLhttps:/doi.org/10.1038/nature24270.C.Snell,J.Lee,K.Xu,and A.Kumar.Scaling llm test-ti
161、me compute optimally can be moreeffective than scaling model parameters,2024.URLhttps:/arxiv.org/abs/2408.03314.T.Trinh,Y.Wu,Q.Le,H.He,and T.Luong.Solving olympiad geometry without humandemonstrations.Nature,2024.doi:10.1038/s41586-023-06747-5.J.Uesato,N.Kushman,R.Kumar,F.Song,N.Siegel,L.Wang,A.Cres
162、well,G.Irving,andI.Higgins.Solving math word problems with process-and outcome-based feedback.arXivpreprint arXiv:2211.14275,2022.P.Wang,L.Li,Z.Shao,R.Xu,D.Dai,Y.Li,D.Chen,Y.Wu,and Z.Sui.Math-shepherd:A label-free step-by-step verifier for llms in mathematical reasoning.arXiv preprint arXiv:2312.089
163、35,2023.X.Wang,J.Wei,D.Schuurmans,Q.Le,E.Chi,S.Narang,A.Chowdhery,and D.Zhou.Self-consistency improves chain of thought reasoning in language models.arXiv preprintarXiv:2203.11171,2022.Y.Wang,X.Ma,G.Zhang,Y.Ni,A.Chandra,S.Guo,W.Ren,A.Arulraj,X.He,Z.Jiang,T.Li,M.Ku,K.Wang,A.Zhuang,R.Fan,X.Yue,and W.C
164、hen.Mmlu-pro:A more robust andchallenging multi-task language understanding benchmark.CoRR,abs/2406.01574,2024.URLhttps:/doi.org/10.48550/arXiv.2406.01574.C.S.Xia,Y.Deng,S.Dunn,and L.Zhang.Agentless:Demystifying llm-based softwareengineering agents.arXiv preprint,2024.H.Xin,Z.Z.Ren,J.Song,Z.Shao,W.Z
165、hao,H.Wang,B.Liu,L.Zhang,X.Lu,Q.Du,W.Gao,Q.Zhu,D.Yang,Z.Gou,Z.F.Wu,F.Luo,and C.Ruan.Deepseek-prover-v1.5:Harnessingproof assistant feedback for reinforcement learning and monte-carlo tree search,2024.URLhttps:/arxiv.org/abs/2408.08152.J.Zhou,T.Lu,S.Mishra,S.Brahma,S.Basu,Y.Luan,D.Zhou,and L.Hou.Inst
166、ruction-followingevaluation for large language models.arXiv preprint arXiv:2311.07911,2023.19AppendixA.Contributions and AcknowledgmentsCore ContributorsDaya GuoDejian YangHaowei ZhangJunxiao SongRuoyu ZhangRunxin XuQihao ZhuShirong MaPeiyi WangXiao BiXiaokang ZhangXingkai YuYu WuZ.F.WuZhibin GouZhi
167、hong ShaoZhuoshu LiZiyi GaoContributorsAixin LiuBing XueBingxuan WangBochao WuBei FengChengda LuChenggang ZhaoChengqi DengChong RuanDamai DaiDeli ChenDongjie JiErhang LiFangyun LinFucong DaiFuli Luo*Guangbo HaoGuanting ChenGuowei LiH.ZhangHanwei XuHonghui DingHuazuo GaoHui QuHui LiJianzhong GuoJiash
168、i LiJingchang ChenJingyang YuanJinhao TuJunjie QiuJunlong LiJ.L.CaiJiaqi NiJian LiangJin ChenKai DongKai Hu*Kaichao YouKaige GaoKang GuanKexin HuangKuai YuLean WangLecong ZhangLiang ZhaoLitong WangLiyue ZhangLei XuLeyi XiaMingchuan ZhangMinghua ZhangMinghui TangMingxu ZhouMeng LiMiaojun WangMingming
169、 LiNing TianPanpan HuangPeng ZhangQiancheng WangQinyu ChenQiushi DuRuiqi Ge*Ruisong ZhangRuizhe PanRunji WangR.J.ChenR.L.Jin20Ruyi ChenShanghao LuShangyan ZhouShanhuang ChenShengfeng YeShiyu WangShuiping YuShunfeng ZhouShuting PanS.S.LiShuang ZhouShaoqing WuShengfeng YeTao YunTian PeiTianyu SunT.Wan
170、gWangding ZengWen LiuWenfeng LiangWenjun GaoWenqin Yu*Wentao ZhangW.L.XiaoWei AnXiaodong LiuXiaohan WangXiaokang ChenXiaotao NieXin ChengXin LiuXin XieXingchao LiuXinyu YangXinyuan LiXuecheng SuXuheng LinX.Q.LiXiangyue JinXiaojin ShenXiaosha ChenXiaowen SunXiaoxiang WangXinnan SongXinyi ZhouXianzu W
171、angXinxia ShanY.K.LiY.Q.WangY.X.WeiYang ZhangYanhong XuYao LiYao ZhaoYaofeng SunYaohui WangYi YuYichao ZhangYifan ShiYiliang XiongYing HeYishi PiaoYisong WangYixuan TanYiyang Ma*Yiyuan LiuYongqiang GuoYuan OuYuduan WangYue GongYuheng ZouYujia HeYunfan XiongYuxiang LuoYuxiang YouYuxuan LiuYuyang Zhou
172、Y.X.ZhuYanping HuangYaohui LiYi ZhengYuchen ZhuYunxian MaYing TangYukun ZhaYuting YanZ.Z.RenZehui RenZhangli ShaZhe FuZhean XuZhenda XieZhengyan ZhangZhewen HaoZhicheng MaZhigang YanZhiyu WuZihui Gu21Zijia ZhuZijun Liu*Zilin LiZiwei XieZiyang SongZizheng PanZhen HuangZhipeng XuZhongyu ZhangZhen ZhangWithin each role,authors are listed alphabetically by the first name.Names marked with*denote individuals who have departed from our team.22