1、KIMI K1.5:SCALINGREINFORCEMENTLEARNING WITHLLMSTECHNICALREPORT OFKIMI K1.5Kimi TeamABSTRACTLanguage model pretraining with next token prediction has proved effective for scaling compute butis limited to the amount of available training data.Scaling reinforcement learning(RL)unlocks a newaxis for the
2、 continued improvement of artifi cial intelligence,with the promise that large languagemodels(LLMs)can scale their training data by learning to explore with rewards.However,priorpublished work has not produced competitive results.In light of this,we report on the training practiceof Kimi k1.5,our la
3、test multi-modal LLM trained with RL,including its RL training techniques,multi-modal data recipes,and infrastructure optimization.Long context scaling and improved policyoptimization methods are key ingredients of our approach,which establishes a simplistic,effectiveRL framework without relying on
4、more complex techniques such as Monte Carlo tree search,valuefunctions,and process reward models.Notably,our system achieves state-of-the-art reasoningperformance across multiple benchmarks and modalitiese.g.,77.5 on AIME,96.2 on MATH500,94-th percentile on Codeforces,74.9 on MathVistamatching OpenA
5、Is o1.Moreover,wepresent effective long2short methods that use long-CoT techniques to improve short-CoT models,yielding state-of-the-art short-CoT reasoning resultse.g.,60.8 on AIME,94.6 on MATH500,47.3on LiveCodeBenchoutperforming existing short-CoT models such as GPT-4o and Claude Sonnet3.5 by a l
6、arge margin(up to+550%).Kimi k1.5 long-CoTOpenAI o1OpenAI o1-miniQwQ-32B PreviewQVQ-72B-PreviewVision74.974.97171.4MathVista(Pass1)707077.370.3MMMU(Pass1)Code9494948862Codeforces(Percentile)62.562.567.253.140.6LiveCodeBench v5 24.12-25.2(Pass1)Math96.296.294.89090.6MATH 500(EM)77.577.574.463.650AIME
7、 2024(Pass1)Figure 1:Kimi k1.5 long-CoT results.arXiv:2501.12599v1 cs.AI 22 Jan 2025Kimi k1.5TECHNICALREPORTKimi k1.5 short-CoTOpenAI 4oClaude 3.5 Sonnet DeepSeek V3LLaMA-3.1 405B-Inst.GeneralCodeMath47.333.4LiveCodeBench v4 24.08-24.11(Pass1-COT)36.328.440.531.1Qwen2.5 72B-Inst.87.487.2MMLU(EM)88.3
8、88.688.585.394.674.6MATH-500(EM)78.373.890.28060.89.3AIME 2024(Pass1)1623.339.223.36869.1MMMU_val(Pass1)66.464.570.163.8MathVista_test(Pass1)65.369.787.284.3IF-Eval(Prompt Strict)86.58686.184.191.787.9CLUEWSC(EM)85.484.790.991.488.376C-Eval(EM)76.761.586.586.1Qwen2-VLVisionFigure 2:Kimi k1.5 short-C
9、oT results.1IntroductionLanguage model pretraining with next token prediction has been studied under the context of the scaling law,whereproportionally scaling model parameters and data sizes leads to the continued improvement of intelligence.(Kaplanet al.2020;Hoffmann et al.2022)However,this approa
10、ch is limited to the amount of available high-quality trainingdata(Villalobos et al.2024;Muennighoff et al.2023).In this report,we present the training recipe of Kimi k1.5,our latest multi-modal LLM trained with reinforcement learning(RL).The goal is to explore a possible new axisfor continued scali
11、ng.Using RL with LLMs,the models learns to explore with rewards and thus is not limited to apre-existing static dataset.There are a few key ingredients about the design and training of k1.5.Long context scaling.We scale the context window of RL to 128k and observe continued improvement ofperformance
12、 with an increased context length.A key idea behind our approach is to use partial rollouts to improvetraining effi ciencyi.e.,sampling new trajectories by reusing a large chunk of previous trajectories,avoidingthe cost to re-generate the new trajectories from scratch.Our observation identifi es the
13、 context length as a keydimension of the continued scaling of RL with LLMs.Improved policy optimization.We derive a formulation of RL with long-CoT and employ a variant of onlinemirror descent for robust policy optimization.This algorithm is further improved by our effective sampling strategy,length
14、 penalty,and optimization of the data recipe.Simplistic Framework.Long context scaling,combined with the improved policy optimization methods,estab-lishes a simplistic RL framework for learning with LLMs.Since we are able to scale the context length,the learnedCoTs exhibit the properties of planning
15、,reflection,and correction.An increased context length has an effect ofincreasing the number of search steps.As a result,we show that strong performance can be achieved withoutrelying on more complex techniques such as Monte Carlo tree search,value functions,and process reward models.Multimodalities
16、.Our model is jointly trained on text and vision data,which has the capabilities of jointly reasoningover the two modalities.Moreover,we present effective long2short methods that use long-CoT techniques to improve short-CoT models.Specifi cally,our approaches include applying length penalty with lon
17、g-CoT activations and model merging.Our long-CoT version achieves state-of-the-art reasoning performance across multiple benchmarks and modalitiese.g.,77.5 on AIME,96.2 on MATH 500,94-th percentile on Codeforces,74.9 on MathVistamatching OpenAIs o1.Ourmodel also achieves state-of-the-art short-CoT r
18、easoning resultse.g.,60.8 on AIME,94.6 on MATH500,47.3 onLiveCodeBenchoutperforming existing short-CoT models such as GPT-4o and Claude Sonnet 3.5 by a large margin(up to+550%).Results are shown in Figures 1 and 2.2Kimi k1.5TECHNICALREPORT2Approach:Reinforcement Learning with LLMsThe development of
19、Kimi k1.5 consists of several stages:pretraining,vanilla supervised fi ne-tuning(SFT),long-CoTsupervised fi ne-turning,and reinforcement learning(RL).This report focuses on RL,beginning with an overview ofthe RL prompt set curation(Section 2.1)and long-CoT supervised fi netuning(Section 2.2),followe
20、d by an in-depthdiscussion of RL training strategies in Section 2.3.Additional details on pretraining and vanilla supervised fi netuningcan be found in Section 2.5.2.1RL Prompt Set CurationThrough our preliminary experiments,we found that the quality and diversity of the RL prompt set play a critica
21、l role inensuring the effectiveness of reinforcement learning.A well-constructed prompt set not only guides the model towardrobust reasoning but also mitigates the risk of reward hacking and overfi tting to superfi cial patterns.Specifi cally,threekey properties defi ne a high-quality RL prompt set:
22、Diverse Coverage:Prompts should span a wide array of disciplines,such as STEM,coding,and general reasoning,to enhance the models adaptability and ensure broad applicability across different domains.Balanced Difficulty:The prompt set should include a well-distributed range of easy,moderate,and diffi
23、cultquestions to facilitate gradual learning and prevent overfi tting to specifi c complexity levels.Accurate Evaluability:Prompts should allow objective and reliable assessment by verifi ers,ensuring that modelperformance is measured based on correct reasoning rather than superfi cial patterns or r
24、andom guess.To achieve diverse coverage in the prompt set,we employ automatic fi lters to select questions that require richreasoning and are straightforward to evaluate.Our dataset includes problems from various domains,such as STEMfi elds,competitions,and general reasoning tasks,incorporating both
25、 text-only and image-text question-answeringdata.Furthermore,we developed a tagging system to categorize prompts by domain and discipline,ensuring balancedrepresentation across different subject areas(M.Li et al.2023;W.Liu et al.2023).We adopt a model-based approach that leverages the models own cap
26、acity to adaptively assess the diffi culty of eachprompt.Specifi cally,for every prompt,an SFT model generates answers ten times using a relatively high samplingtemperature.The pass rate is then calculated and used as a proxy for the prompts diffi cultythe lower the pass rate,the higher the diffi cu
27、lty.This approach allows diffi culty evaluation to be aligned with the models intrinsic capabilities,making it highly effective for RL training.By leveraging this method,we can prefi lter most trivial cases and easilyexplore different sampling strategies during RL training.To avoid potential reward
28、hacking(Everitt et al.2021;Pan et al.2022),we need to ensure that both the reasoningprocess and the fi nal answer of each prompt can be accurately verifi ed.Empirical observations reveal that somecomplex reasoning problems may have relatively simple and easily guessable answers,leading to false posi
29、tiveverifi cationwhere the model reaches the correct answer through an incorrect reasoning process.To address thisissue,we exclude questions that are prone to such errors,such as multiple-choice,true/false,and proof-based questions.Furthermore,for general question-answering tasks,we propose a simple
30、 yet effective method to identify and removeeasy-to-hack prompts.Specifi cally,we prompt a model to guess potential answers without any CoT reasoning steps.If the model predicts the correct answer withinNattempts,the prompt is considered too easy-to-hack and removed.We found that settingN=8can remov
31、e the majority easy-to-hack prompts.Developing more advanced verifi cationmodels remains an open direction for future research.2.2Long-CoT Supervised Fine-TuningWith the refi ned RL prompt set,we employ prompt engineering to construct a small yet high-quality long-CoT warmupdataset,containing accura
32、tely verifi ed reasoning paths for both text and image inputs.This approach resembles rejectionsampling(RS)but focuses on generating long-CoT reasoning paths through prompt engineering.The resulting warmupdataset is designed to encapsulate key cognitive processes that are fundamental to human-like r
33、easoning,such asplanning,where the model systematically outlines steps before execution;evaluation,involving critical assessment ofintermediate steps;reflection,enabling the model to reconsider and refi ne its approach;and exploration,encouragingconsideration of alternative solutions.By performing a
34、 lightweight SFT on this warm-up dataset,we effectively primethe model to internalize these reasoning strategies.As a result,the fi ne-tuned long-CoT model demonstrates improvedcapability in generating more detailed and logically coherent responses,which enhances its performance across diversereason
35、ing tasks.3Kimi k1.5TECHNICALREPORT2.3Reinforcement Learning2.3.1Problem SettingGiven a training datasetD=(xi,yi)ni=1of problemsxiand corresponding ground truth answersyi,our goal is totrain a policy modelto accurately solve test problems.In the context of complex reasoning,the mapping of problemxto
36、 solutionyis non-trivial.To tackle this challenge,the chain of thought(CoT)method proposes to use a sequenceof intermediate stepsz=(z1,z2,.,zm)to bridgexandy,where eachziis a coherent sequence of tokens that actsas a signifi cant intermediate step toward solving the problem(J.Wei et al.2022).When so
37、lving problemx,thoughtszt(|x,z1,.,zt1)are auto-regressively sampled,followed by the fi nal answery (|x,z1,.,zm).Weusey,z to denote this sampling procedure.Note that both the thoughts and fi nal answer are sampled as a languagesequence.To further enhance the models reasoning capabilities,planning alg
38、orithms are employed to explore various thoughtprocesses,generating improved CoT at inference time(Yao et al.2024;Y.Wu et al.2024;Snell et al.2024).Thecore insight of these approaches is the explicit construction of a search tree of thoughts guided by value estimations.This allows the model to explo
39、re diverse continuations of a thought process or backtrack to investigate new directionswhen encountering dead ends.In more detail,letTbe a search tree where each node represents a partial solutions=(x,z1:|s|).Heresconsists of the problemxand a sequence of thoughtsz1:|s|=(z1,.,z|s|)leading up tothat
40、 node,with|s|denoting number of thoughts in the sequence.The planning algorithm uses a critic modelvtoprovide feedbackv(x,z1:|s|),which helps evaluate the current progress towards solving the problem and identify anyerrors in the existing partial solution.We note that the feedback can be provided by
41、 either a discriminative score or alanguage sequence(L.Zhang et al.2024).Guided by the feedbacks for alls T,the planning algorithm selects themost promising node for expansion,thereby growing the search tree.The above process repeats iteratively until a fullsolution is derived.We can also approach p
42、lanning algorithms from an algorithmic perspective.Given past search history available at thet-th iteration(s1,v(s1),.,st1,v(st1),a planning algorithmAiteratively determines the next search directionA(st|s1,v(s1),.,st1,v(st1)and provides feedbacks for the current search progressA(v(st)|s1,v(s1),.,st
43、).Since both thoughts and feedbacks can be viewed as intermediate reasoning steps,and these components can both berepresented as sequence of language tokens,we usezto replacesandvto simplify the notations.Accordingly,we viewa planning algorithm as a mapping that directly acts on a sequence of reason
44、ing stepsA(|z1,z2,.).In this framework,all information stored in the search tree used by the planning algorithm is flattened into the full context provided to thealgorithm.This provides an intriguing perspective on generating high-quality CoT:Rather than explicitly constructing asearch tree and impl
45、ementing a planning algorithm,we could potentially train a model to approximate this process.Here,the number of thoughts(i.e.,language tokens)serves as an analogy to the computational budget traditionally allocatedto planning algorithms.Recent advancements in long context windows facilitate seamless
46、 scalability during both thetraining and testing phases.If feasible,this method enables the model to run an implicit search over the reasoning spacedirectly via auto-regressive predictions.Consequently,the model not only learns to solve a set of training problems butalso develops the ability to tack
47、le individual problems effectively,leading to improved generalization to unseen testproblems.We thus consider training the model to generate CoT with reinforcement learning(RL)(OpenAI 2024).Letrbe areward model that justifi es the correctness of the proposed answeryfor the given problemxbased on the
48、 ground truthy,by assigning a valuer(x,y,y)0,1.For verifi able problems,the reward is directly determined by predefi nedcriteria or rules.For example,in coding problems,we assess whether the answer passes the test cases.For problemswith free-form ground truth,we train a reward modelr(x,y,y)that pred
49、icts if the answer matches the ground truth.Given a problemx,the modelgenerates a CoT and the fi nal answer through the sampling procedurez (|x),y (|x,z).The quality of the generated CoT is evaluated by whether it can lead to a correct fi nal answer.Insummary,we consider the following objective to o
50、ptimize the policymaxE(x,y)D,(y,z)r(x,y,y).(1)By scaling up RL training,we aim to train a model that harnesses the strengths of both simple prompt-based CoTand planning-augmented CoT.The model still auto-regressively sample language sequence during inference,therebycircumventing the need for the com
51、plex parallelization required by advanced planning algorithms during deployment.However,a key distinction from simple prompt-based methods is that the model should not merely follow a series ofreasoning steps.Instead,it should also learn critical planning skills including error identifi cation,backt
52、racking andsolution refi nement by leveraging the entire set of explored thoughts as contextual information.4Kimi k1.5TECHNICALREPORT2.3.2Policy OptimizationWe apply a variant of online policy mirror decent as our training algorithm(Abbasi-Yadkori et al.2019;Mei et al.2019;Tomar et al.2020).The algo
53、rithm performs iteratively.At thei-th iteration,we use the current modelias a referencemodel and optimize the following relative entropy regularized policy optimization problem,maxE(x,y)D?E(y,z)r(x,y,y)KL(x)|i(x)?,(2)where 0 is a parameter controlling the degree of regularization.This objective has
54、a closed form solution(y,z|x)=i(y,z|x)exp(r(x,y,y)/)/Z.HereZ=Py,zi(y,z|x)exp(r(x,y,y)/)is the normalization factor.Taking logarithm of both sides we havefor any(y,z)the following constraint is satisfi ed,which allows us to leverage off-policy data during optimizationr(x,y,y)logZ=log(y,z|x)i(y,z|x).T
55、his motivates the following surrogate lossL()=E(x,y)DE(y,z)i?r(x,y,y)logZ log(y,z|x)i(y,z|x)?2#.To approximate logZ,we use samples(y1,z1),.,(yk,zk)i:logZ log1kPkj=1exp(r(x,yj,y)/).We also fi nd that using empirical mean of sampled rewardsr=mean(r(x,y1,y),.,r(x,yk,y)yields effectivepractical results.
56、This is reasonable since logZapproaches the expected reward underias .Finally,weconclude our learning algorithm by taking the gradient of surrogate loss.For each problemx,kresponses are sampledusing the reference policy i,and the gradient is given by1kkXj=1 log(yj,zj|x)(r(x,yj,y)r)2?log(yj,zj|x)i(yj
57、,zj|x)?2!.(3)To those familiar with policy gradient methods,this gradient resembles the policy gradient of(2)using the mean ofsampled rewards as the baseline(Kool et al.2019;Ahmadian et al.2024).The main differences are that the responsesare sampled fromirather than on-policy,and anl2-regularization
58、 is applied.Thus we could see this as the naturalextension of a usual on-policy regularized policy gradient algorithm to the off-policy case(Nachum et al.2017).Wesample a batch of problems fromDand update the parameters toi+1,which subsequently serves as the referencepolicy for the next iteration.Si
59、nce each iteration considers a different optimization problem due to the changingreference policy,we also reset the optimizer at the start of each iteration.We exclude the value network in our training system which has also been exploited in previous studies(Ahmadian et al.2024).While this design ch
60、oice signifi cantly improves training effi ciency,we also hypothesize that the conventional useof value functions for credit assignment in classical RL may not be suitable for our context.Consider a scenario wherethe model has generated a partial CoT(z1,z2,.,zt)and there are two potential next reaso
61、ning steps:zt+1andzt+1.Assume thatzt+1directly leads to the correct answer,whilezt+1contains some errors.If an oracle value function wereaccessible,it would indicate thatzt+1preserves a higher value compared tozt+1.According to the standard creditassignment principle,selectingzt+1would be penalized
62、as it has a negative advantages relative to the current policy.However,exploringzt+1is extremely valuable for training the model to generate long CoT.By using the justifi cation ofthe fi nal answer derived from a long CoT as the reward signal,the model can learn the pattern of trial and error fromta
63、king zt+1as long as it successfully recovers and reaches the correct answer.The key takeaway from this example isthat we should encourage the model to explore diverse reasoning paths to enhance its capability in solving complexproblems.This exploratory approach generates a wealth of experience that
64、supports the development of critical planningskills.Our primary goal is not confi ned to attaining high accuracy on training problems but focuses on equipping themodel with effective problem-solving strategies,ultimately improving its performance on test problems.2.3.3Length PenaltyWe observe an ove
65、rthinking phenomenon that the models response length signifi cantly increases during RL training.Although this leads to better performance,an excessively lengthy reasoning process is costly during training andinference,and overthinking is often not preferred by humans.To address this issue,we introd
66、uce a length reward torestrain the rapid growth of token length,thereby improving the models token effi ciency.Given k sampled responses5Kimi k1.5TECHNICALREPORT(y1,z1),.,(yk,zk)of problemxwith true answery,letlen(i)be the length of(yi,zi),min_len=minilen(i)andmax_len=maxilen(i).Ifmax_len=min_len,we
67、 set length reward zero for all responses,as they have the samelength.Otherwise the length reward is given bylen_reward(i)=?If r(x,yi,y)=1min(0,)If r(x,yi,y)=0,where =0.5 len(i)min_lenmax_len min_len.In essence,we promote shorter responses and penalize longer responses among correct ones,while expli
68、citly penalizinglong responses with incorrect answers.This length-based reward is then added to the original reward with a weightingparameter.In our preliminary experiments,length penalty may slow down training during the initial phases.To alleviate thisissue,we propose to gradually warm up the leng
69、th penalty during training.Specifi cally,we employ standard policyoptimization without length penalty,followed by a constant length penalty for the rest of training.2.3.4Sampling StrategiesAlthough RL algorithms themselves have relatively good sampling properties(with more diffi cult problems provid
70、inglarger gradients),their training effi ciency is limited.Consequently,some well-defi ned prior sampling methods can yieldpotentially greater performance gains.We exploit multiple signals to further improve the sampling strategy.First,theRL training data we collect naturally come with different dif
71、fi culty labels.For example,a math competition problem ismore diffi cult than a primary school math problem.Second,because the RL training process samples the same problemmultiple times,we can also track the success rate for each individual problem as a metric of diffi culty.We propose twosampling m
72、ethods to utilize these priors to improve training effi ciency.Curriculum SamplingWe start by training on easier tasks and gradually progress to more challenging ones.Sincethe initial RL model has limited performance,spending a restricted computation budget on very hard problems oftenyields few corr
73、ect samples,resulting in lower training effi ciency.Meanwhile,our collected data naturally includes gradeand diffi culty labels,making diffi culty-based sampling an intuitive and effective way to improve training effi ciency.Prioritized SamplingIn addition to curriculum sampling,we use a prioritized
74、 sampling strategy to focus on problemswhere the model underperforms.We track the success ratessifor each problemiand sample problems proportional to1 si,so that problems with lower success rates receive higher sampling probabilities.This directs the models effortstoward its weakest areas,leading to
75、 faster learning and better overall performance.2.3.5More Details on Training RecipeTest Case Generation for CodingSince test cases are not available for many coding problems from the web,wedesign a method to automatically generate test cases that serve as a reward to train our model with RL.Our foc
76、us isprimarily on problems that do not require a special judge.We also assume that ground truth solutions are available forthese problems so that we can leverage the solutions to generate higher quality test cases.We utilize the widely recognized test case generation library,CYaRon1,to enhance our a
77、pproach.We employ ourbase Kimi k1.5 to generate test cases based on problem statements.The usage statement of CYaRon and the problemdescription are provided as the input to the generator.For each problem,we fi rst use the generator to produce 50 testcases and also randomly sample 10 ground truth sub
78、missions for each test case.We run the test cases against thesubmissions.A test case is deemed valid if at least 7 out of 10 submissions yield matching results.After this round offi ltering,we obtain a set of selected test cases.A problem and its associated selected test cases are added to our train
79、ingset if at least 9 out of 10 submissions pass the entire set of selected test cases.In terms of statistics,from a sample of 1,000 online contest problems,approximately 614 do not require a specialjudge.We developed 463 test case generators that produced at least 40 valid test cases,leading to the
80、inclusion of 323problems in our training set.Reward Modeling for MathOne challenge in evaluating math solutions is that different written forms can representthe same underlying answer.For instance,a2 4and(a+2)(a 2)may both be valid solutions to the same problem.We adopted two methods to improve the
81、reward models scoring accuracy:1.Classic RM:Drawing inspiration from the InstructGPT(Ouyang et al.2022)methodology,we implemented avalue-head based reward model and collected approximately 800k data points for fi ne-tuning.The model ultimately1https:/ k1.5TECHNICALREPORTtakes as input the“question,”
82、the“reference answer,”and the“response,”and outputs a single scalar that indicateswhether the response is correct.2.Chain-of-Thought RM:Recent research(Ankner et al.2024;McAleese et al.2024)suggests that reward modelsaugmented with chain-of-thought(CoT)reasoning can signifi cantly outperform classic
83、 approaches,particularlyon tasks where nuanced correctness criteria mattersuch as mathematics.Therefore,we collected an equallylarge dataset of about 800k CoT-labeled examples to fi ne-tune the Kimi model.Building on the same inputs as theClassic RM,the chain-of-thought approach explicitly generates
84、 a step-by-step reasoning process before providinga fi nal correctness judgment in JSON format,enabling more robust and interpretable reward signals.During our manual spot checks,the Classic RM achieved an accuracy of approximately 84.4,while the Chain-of-Thought RM reached 98.5 accuracy.In the RL t
85、raining process,we adopted the Chain-of-Thought RM to ensure morecorrect feedback.Vision DataTo improve the models real-world image reasoning capabilities and to achieve a more effective alignmentbetweenvisualinputsandlargelanguagemodels(LLMs),ourvisionreinforcementlearning(VisionRL)dataisprimarilys
86、ourced from three distinct categories:Real-world data,Synthetic visual reasoning data,and Text-rendered data.1.The real-world data encompass a range of science questions across various grade levels that require graphicalcomprehension and reasoning,location guessing tasks that necessitate visual perc
87、eption and inference,and dataanalysis that involves understanding complex charts,among other types of data.These datasets improve the modelsability to perform visual reasoning in real-world scenarios.2.Synthetic visual reasoning data is artifi cially generated,including procedurally created images a
88、nd scenes aimedat improving specifi c visual reasoning skills,such as understanding spatial relationships,geometric patterns,andobject interactions.These synthetic datasets offer a controlled environment for testing the models visual reasoningcapabilities and provide an endless supply of training ex
89、amples.3.Text-rendered data is created by converting textual content into visual format,enabling the model to maintainconsistency when handling text-based queries across different modalities.By transforming text documents,codesnippets,and structured data into images,we ensure the model provides cons
90、istent responses regardless of whetherthe input is pure text or text rendered as images(like screenshots or photos).This also helps to enhance the modelscapability when dealing with text-heavy images.Each type of data is essential in building a comprehensive visual language model that can effectivel
91、y manage a widerange of real-world applications while ensuring consistent performance across various input modalities.2.4Long2short:Context Compression for Short-CoT ModelsThough long-CoT models achieve strong performance,it consumes more test-time tokens compared to standardshort-CoT LLMs.However,i
92、t is possible to transfer the thinking priors from long-CoT models to short-CoT modelsso that performance can be improved even with limited test-time token budgets.We present several approaches forthis long2short problem,including model merging(Yang et al.2024),shortest rejection sampling,DPO(Rafail
93、ov et al.2024),and long2short RL.Detailed descriptions of these methods are provided below:Model MergingModel merging has been found to be useful in maintaining generalization ability.We also discoveredits effectiveness in improving token effi ciency when merging a long-cot model and a short-cot mod
94、el.This approachcombines a long-cot model with a shorter model to obtain a new one without training.Specifi cally,we merge the twomodels by simply averaging their weights.Shortest Rejection SamplingWe observed that our model generates responses with a large length variation for thesame problem.Based
95、 on this,we designed the Shortest Rejection Sampling method.This method samples the samequestion n times(in our experiments,n=8)and selects the shortest correct response for supervised fi ne-tuning.DPOSimilar with Shortest Rejection Sampling,we utilize the Long CoT model to generate multiple respons
96、e samples.The shortest correct solution is selected as the positive sample,while longer responses are treated as negative samples,including both wrong longer responses and correct longer responses(1.5 times longer than the chosen positive sample).These positive-negative pairs form the pairwise prefe
97、rence data used for DPO training.7Kimi k1.5TECHNICALREPORTLong2short RLAfter a standard RL training phase,we select a model that offers the best balance between perfor-mance and token effi ciency to serve as the base model,and conduct a separate long2short RL training phase.In thissecond phase,we ap
98、ply the length penalty introduced in Section 2.3.3,and signifi cantly reduce the maximum rolloutlength to further penalize responses that exceed the desired length while possibly correct.2.5Other Training Details2.5.1PretrainingThe Kimi k1.5 base model is trained on a diverse,high-quality multimodal
99、 corpus.The language data covers fi vedomains:English,Chinese,Code,Mathematics Reasoning,and Knowledge.Multimodal data,including Captioning,Image-text Interleaving,OCR,Knowledge,and QA datasets,enables our model to acquire vision-language capabilities.Rigorous quality control ensures relevance,diver
100、sity,and balance in the overall pretrain dataset.Our pretrainingproceeds in three stages:(1)Vision-language pretraining,where a strong language foundation is established,followedby gradual multimodal integration;(2)Cooldown,which consolidates capabilities using curated and synthetic data,particularl
101、y for reasoning and knowledge-based tasks;and(3)Long-context activation,extending sequence processingto 131,072 tokens.More details regarding our pretraining efforts can be found in Appendix B.2.5.2Vanilla Supervised FinetuningWe create the vanilla SFT corpus covering multiple domains.For non-reason
102、ing tasks,including question-answering,writing,and text processing,we initially construct a seed dataset through human annotation.This seed dataset is usedto train a seed model.Subsequently,we collect a diverse of prompts and employ the seed model to generate multipleresponses to each prompt.Annotat
103、ors then rank these responses and refi ne the top-ranked response to produce thefi nal version.For reasoning tasks such as math and coding problems,where rule-based and reward modeling basedverifi cations are more accurate and effi cient than human judgment,we utilize rejection sampling to expand th
104、e SFTdataset.Our vanilla SFT dataset comprises approximately 1 million text examples.Specifi cally,500k examples are for generalquestion answering,200k for coding,200k for math and science,5k for creative writing,and 20k for long-contexttasks such as summarization,doc-qa,translation,and writing.In a
105、ddition,we construct 1 million text-vision examplesencompassing various categories including chart interpretation,OCR,image-grounded conversations,visual coding,visual reasoning,and math/science problems with visual aids.We fi rst train the model at the sequence length of 32k tokens for 1 epoch,foll
106、owed by another epoch at the sequencelength of 128k tokens.In the fi rst stage(32k),the learning rate decays from2105to2106,before it re-warmupsto1105in the second stage(128k)and fi nally decays to1106.To improve training effi ciency,we pack multipletraining examples into each single training sequen
107、ce.2.6RL InfrastructureRollout WorkersMasterReward ModelsReplay BufferCodeMathK-12Visionrollout trajectorieseval request training dataweightgradientupdateweight flowdata flowPolicyModelReferenceModelTrainer Workers(a)System overviewReplay Bufferrollout workerfrompromtsetiteration Nnormal stopcut by
108、lengthrepeat,early stopsave for partial rolloutpartial rollout(b)Partial RolloutFigure 3:Large Scale Reinforcement Learning Training System for LLM8Kimi k1.5TECHNICALREPORT2.6.1Large Scale Reinforcement Learning Training System for LLMIn the realm of artifi cial intelligence,reinforcement learning(R
109、L)has emerged as a pivotal training methodologyfor large language models(LLMs)(Ouyang et al.2022)(Jaech et al.2024),drawing inspiration from its success inmastering complex games like Go,StarCraft II,and Dota 2 through systems such as AlphaGo(Silver et al.2017),AlphaStar(Vinyals et al.2019),and Open
110、AI Dota Five(Berner et al.2019).Following in this tradition,the Kimik1.5 system adopts an iterative synchronous RL framework,meticulously designed to bolster the models reasoningcapabilities through persistent learning and adaptation.A key innovation in this system is the introduction of a PartialRo
111、llout technique,designed to optimize the handling of complex reasoning trajectories.The RL training system as illustrated in Figure 3a operates through an iterative synchronous approach,with eachiteration encompassing a rollout phase and a training phase.During the rollout phase,rollout workers,coor
112、dinatedby a central master,generate rollout trajectories by interacting with the model,producing sequences of responses tovarious inputs.These trajectories are then stored in a replay buffer,which ensures a diverse and unbiased dataset fortraining by disrupting temporal correlations.In the subsequen
113、t training phase,trainer workers access these experiencesto update the models weights.This cyclical process allows the model to continuously learn from its actions,adjustingits strategies over time to enhance performance.The central master serves as the central conductor,managing the flow of data an
114、d communication between the rolloutworkers,trainer workers,evaluation with reward models and the replay buffer.It ensures that the system operatesharmoniously,balancing the load and facilitating effi cient data processing.The trainer workers access these rollout trajectories,whether completed in a s
115、ingle iteration or divided across multipleiterations,to compute gradient updates that refi ne the models parameters and enhance its performance.This processis overseen by a reward model,which evaluates the quality of the models outputs and provides essential feedback toguide the training process.The
116、 reward models evaluations are particularly pivotal in determining the effectiveness ofthe models strategies and steering the model towards optimal performance.Moreover,the system incorporates a code execution service,which is specifi cally designed to handle code-relatedproblems and is integral to
117、the reward model.This service evaluates the models outputs in practical coding scenarios,ensuring that the models learning is closely aligned with real-world programming challenges.By validating the modelssolutions against actual code executions,this feedback loop becomes essential for refi ning the
118、 models strategies andenhancing its performance in code-related tasks.2.6.2Partial Rollouts for Long CoT RLOne of the primary ideas of our work is to scale long-context RL training.Partial rollouts is a key technique thateffectively addresses the challenge of handling long-CoT features by managing t
119、he rollouts of both long and shorttrajectories.This technique establishes a fi xed output token budget,capping the length of each rollout trajectory.If a trajectory exceeds the token limit during the rollout phase,the unfi nished portion is saved to the replay bufferand continued in the next iterati
120、on.It ensures that no single lengthy trajectory monopolizes the systems resources.Moreover,since the rollout workers operate asynchronously,when some are engaged with long trajectories,others canindependently process new,shorter rollout tasks.The asynchronous operation maximizes computational effi c
121、iencyby ensuring that all rollout workers are actively contributing to the training process,thereby optimizing the overallperformance of the system.As illustrated in Figure 3b,the partial rollout system works by breaking down long responses into segments acrossiterations(from iter n-m to iter n).The
122、 Replay Buffer acts as a central storage mechanism that maintains these responsesegments,where only the current iteration(iter n)requires on-policy computation.Previous segments(iter n-m ton-1)can be effi ciently reused from the buffer,eliminating the need for repeated rollouts.This segmented approa
123、chsignifi cantly reduces the computational overhead:instead of rolling out the entire response at once,the system processesand stores segments incrementally,allowing for the generation of much longer responses while maintaining fast iterationtimes.During training,certain segments can be excluded fro
124、m loss computation to further optimize the learning process,making the entire system both effi cient and scalable.The implementation of partial rollouts also offers repeat detection.The system identifi es repeated sequences in thegenerated content and terminates them early,reducing unnecessary compu
125、tation while maintaining output quality.Detected repetitions can be assigned additional penalties,effectively discouraging redundant content generation in theprompt set.2.6.3Hybrid Deployment of Training and InferenceThe RL training process comprises of the following phases:9Kimi k1.5TECHNICALREPORT
126、podMegatron SidecarvLLM SidecarCheckpoint EngineTrainOnloadWait rolloutOffloadSharedMemoryStartvLLMTerminatevLLMUpdateWeightTerminateRolloutCheckpoint EngineDummyStartRegisterShardUpdateWeightConvert HFetcdRDMAOther PodsFigure 4:Hybrid Deployment FrameworkTraining Phase:At the outset,Megatron(Shoeyb
127、i et al.2020)and vLLM(Kwon et al.2023)are executedwithin separate containers,encapsulated by a shim process known as checkpoint-engine(Section 2.6.3).Megatroncommences the training procedure.After the training is completed,Megatron offloads the GPU memory andprepares to transfer current weights to v
128、LLM.Inference Phase:Following Megatrons offloading,vLLM starts with dummy model weights and updates themwith the latest ones transferred from Megatron via Mooncake(Qin et al.2024).Upon completion of the rollout,thecheckpoint-engine halts all vLLM processes.Subsequent Training Phase:Once the memory a
129、llocated to vLLM is released,Megatron onloads the memory andinitiates another round of training.We fi nd existing works challenging to simultaneously support all the following characteristics.Complex parallelism strategy:Megatron may have different parallelism strategy with vLLM.Training weightsdist
130、ributing in several nodes in Megatron could be challenging to be shared with vLLM.Minimizing idle GPU resources:For On-Policy RL,recent works such as SGLang(L.Zheng et al.2024)andvLLM might reserve some GPUs during the training process,which conversely could lead to idle training GPUs.Itwould be mor
131、e effi cient to share the same devices between training and inference.Capability of dynamic scaling:In some cases,a signifi cant acceleration can be achieved by increasing the numberof inference nodes while keeping the training process constant.Our system enables the effi cient utilization of idleGP
132、U nodes when needed.As illustrated in Figure 4,we implement this hybrid deployment framework(Section 2.6.3)on top of Megatron andvLLM,achieving less than one minute from training to inference phase and about ten seconds conversely.Hybrid Deployment StrategyWe propose a hybrid deployment strategy for
133、 training and inference tasks,whichleverages Kubernetes Sidecar containers sharing all available GPUs to collocate both workloads in one pod.Theprimary advantages of this strategy are:It facilitates effi cient resource sharing and management,preventing train nodes idling while waiting for inferencen
134、odes when both are deployed on separate nodes.Leveraging distinct deployed images,training and inference can each iterate independently for better performance.The architecture is not limited to vLLM,other frameworks can be conveniently integrated.Checkpoint EngineCheckpoint Engine is responsible for
135、 managing the lifecycle of the vLLM process,exposingHTTP APIs that enable triggering various operations on vLLM.For overall consistency and reliability,we utilize aglobal metadata system managed by the etcd service to broadcast operations and statuses.10Kimi k1.5TECHNICALREPORTIt could be challengin
136、g to entirely release GPU memory by vLLM offloading primarily due to CUDA graphs,NCCLbuffers and NVIDIA drivers.To minimize modifi cations to vLLM,we terminate and restart it when needed for betterGPU utilization and fault tolerance.The worker in Megatron converts the owned checkpoints into the Hugg
137、ing Face format in shared memory.Thisconversion also takes Pipeline Parallelism and Expert Parallelism into account so that only Tensor Parallelism remainsin these checkpoints.Checkpoints in shared memory are subsequently divided into shards and registered in the globalmetadata system.We employ Moon
138、cake to transfer checkpoints between peer nodes over RDMA.Some modifi cationsto vLLM are needed to load weight fi les and perform tensor parallelism conversion.2.6.4Code SandboxWe developed the sandbox as a secure environment for executing user-submitted code,optimized for code executionand code ben
139、chmark evaluation.By dynamically switching container images,the sandbox supports different use casesthrough MultiPL-E(Cassano,Gouwar,D.Nguyen,S.Nguyen,et al.2023),DMOJ Judge Server2,Lean,JupyterNotebook,and other images.For RL in coding tasks,the sandbox ensures the reliability of training data judg
140、ment by providing consistent and repeat-able evaluation mechanisms.Its feedback system supports multi-stage assessments,such as code execution feedbackand repo-level editing,while maintaining a uniform context to ensure fair and equitable benchmark comparisons acrossprogramming languages.We deploy t
141、he service on Kubernetes for scalability and resilience,exposing it through HTTP endpoints for externalintegration.Kubernetes features like automatic restarts and rolling updates ensure availability and fault tolerance.To optimize performance and support RL environments,we incorporate several techni
142、ques into the code executionservice to enhance effi ciency,speed,and reliability.These include:Using Crun:We utilizecrunas the container runtime instead of Docker,signifi cantly reducing container startuptimes.Cgroup Reusing:We pre-create cgroups for container use,which is crucial in scenarios with
143、high concurrencywhere creating and destroying cgroups for each container can become a bottleneck.Disk Usage Optimization:An overlay fi lesystem with an upper layer mounted astmpfsis used to control diskwrites,providing a fi xed-size,high-speed storage space.This approach is benefi cial for ephemeral
144、 workloads.MethodTime(s)Docker0.12Sandbox0.04(a)Container startup timesMethodContainers/secDocker27Sandbox120(b)Maximum containers started per second on a 16-core machineThese optimizations improve RL effi ciency in code execution,providing a consistent and reliable environment forevaluating RL-gene
145、rated code,essential for iterative training and model improvement.3Experiments3.1EvaluationSince k1.5 is a multimodal model,we conducted comprehensive evaluation across various benchmarks for differentmodalities.The detailed evaluation setup can be found in Appendix C.Our benchmarks primarily consis
146、t of thefollowing three categories:Text Benchmark:MMLU(Hendrycks et al.2020),IF-Eval(J.Zhou et al.2023),CLUEWSC(L.Xu et al.2020),C-EVAL(Y.Huang et al.2023)Reasoning Benchmark:HumanEval-Mul,LiveCodeBench(Jain et al.2024),Codeforces,AIME 2024,MATH-500(Lightman et al.2023)Vision Benchmark:MMMU(Yue,Ni,e
147、t al.2024),MATH-Vision(K.Wang et al.2024),MathVista(Lu et al.2023)2https:/ k1.5TECHNICALREPORT3.2Main ResultsK1.5 long-CoT modelThe performance of the Kimi k1.5 long-CoT model is presented in Table 2.Through long-CoTsupervised fi ne-tuning(described in Section 2.2)and vision-text joint reinforcement
148、 learning(discussed in Section 2.3),the models long-term reasoning capabilities are enhanced signifi cantly.The test-time computation scaling furtherstrengthens its performance,enabling the model to achieve state-of-the-art results across a range of modalities.Ourevaluation reveals marked improvemen
149、ts in the models capacity to reason,comprehend,and synthesize informationover extended contexts,representing a advancement in multi-modal AI capabilities.K1.5 short-CoT modelThe performance of the Kimi k1.5 short-CoT model is presented in Table 3.This modelintegrates several techniques,including tra
150、ditional supervised fi ne-tuning(discussed in Section 2.5.2),reinforcementlearning(explored in Section 2.3),and long-to-short distillation(outlined in Section 2.4).The results demonstratethat the k1.5 short-CoT model delivers competitive or superior performance compared to leading open-source andpro
151、prietary models across multiple tasks.These include text,vision,and reasoning challenges,with notable strengths innatural language understanding,mathematics,coding,and logical reasoning.Benchmark(Metric)Language-only ModelVision-Language ModelQwQ-32BOpenAIQVQ-72BOpenAIKimiPreviewo1-miniPreviewo1k1.5
152、ReasoningMATH-500(EM)90.690.0-94.896.2AIME 2024(Pass1)50.063.6-74.477.5Codeforces(Percentile)6288-9494LiveCodeBench(Pass1)40.653.1-67.262.5VisionMathVista-Test(Pass1)-71.471.074.9MMMU-Val(Pass1)-70.377.370.0MathVision-Full(Pass1)-35.9-38.6Table 2:Performance of Kimi k1.5 long-CoT and flagship open-s
153、ource and proprietary models.Benchmark(Metric)Language-only ModelVision-Language ModelQwen2.5 LLaMA-3.1 DeepSeek Qwen2-VL Claude-3.5-GPT-4o Kimi72B-Inst.405B-Inst.V3Sonnet-10220513k1.5TextMMLU(EM)85.388.688.5-88.387.287.4IF-Eval(Prompt Strict)84.186.086.1-86.584.387.2CLUEWSC(EM)91.484.790.9-85.487.9
154、91.7C-Eval(EM)86.161.586.5-76.776.088.3ReasoningMATH-500(EM)80.073.890.2-78.374.694.6AIME 2024(Pass1)23.323.339.2-16.09.360.8HumanEval-Mul(Pass1)77.377.282.6-81.780.581.5LiveCodeBench(Pass1)31.128.440.5-36.333.447.3VisionMathVista-Test(Pass1)-69.765.363.870.1MMMU-Val(Pass1)-64.566.469.168.0MathVisio
155、n-Full(Pass1)-26.635.630.431.0Table 3:Performance of Kimi k1.5 short-CoT and flagship open-source and proprietary models.VLM modelperformance were obtained from the OpenCompass benchmark platform(https:/ Context ScalingWe employ a mid-sized model to study the scaling properties of RL with LLMs.Figur
156、e 5 illustrates the evolution of bothtraining accuracy and response length across training iterations for the small model variant trained on the mathematicalprompt set.As training progresses,we observe a concurrent increase in both response length and performance accuracy.Notably,more challenging be
157、nchmarks exhibit a steeper increase in response length,suggesting that the model learns togenerate more elaborate solutions for complex problems.Figure 6 indicates a strong correlation between the models12Kimi k1.5TECHNICALREPORToutput context length and its problem-solving capabilities.Our fi nal r
158、un of k1.5 scales to 128k context length andobserves continued improvement on hard reasoning benchmarks.0255075100125150Iterations0.600.650.700.750.80Accuracytotaltemp_1.0PerformanceToken Length0255075100125150Iterations0.300.350.400.450.500.550.60AccuracyOMNI-MATH500PerformanceToken Length025507510
159、0125150Iterations0.7750.8000.8250.8500.8750.9000.9250.950AccuracyMATH500PerformanceToken Length0255075100125150Iterations0.00.10.20.30.40.5AccuracyAIMO2024PerformanceToken Length0255075100125150Iterations0.10.20.30.40.5AccuracyAIME2024PerformanceToken Length0255075100125150Iterations0.650.700.750.80
160、0.850.90AccuracyChatGLMMathPerformanceToken Length0255075100125150Iterations0.800.820.840.860.880.900.920.940.96AccuracyGAOKAOPerformanceToken Length0255075100125150Iterations0.10.20.30.40.5AccuracyGPQAPerformanceToken Length0255075100125150Iterations0.700.750.800.850.90AccuracyBiologyPerformanceTok
161、en Length0255075100125150Iterations0.450.500.550.600.65AccuracyChemistryPerformanceToken Length0255075100125150Iterations0.550.600.650.700.75AccuracyPhysicsPerformanceToken Length0255075100125150Iterations0.600.650.700.750.800.850.900.95AccuracyKAOYANPerformanceToken Length05000100001500020000Token
162、Length050001000015000200002500030000Token Length0200040006000800010000120001400016000Token Length050001000015000200002500030000Token Length050001000015000200002500030000Token Length025005000750010000125001500017500Token Length0200040006000800010000120001400016000Token Length0250050007500100001250015
163、0001750020000Token Length0.00.20.40.60.81.00.00.20.40.60.81.00.00.20.40.60.81.0020004000600080001000012000Token Length02000400060008000100001200014000Token Length02500500075001000012500150001750020000Token Length0.00.20.40.60.81.002500500075001000012500150001750020000Token LengthFigure 5:The changes
164、 on the training accuracy and length as train iterations grow.Note that the scores above comefrom an internal long-cot model with much smaller model size than k1.5 long-CoT model.The shaded area representsthe 95%percentile of the response length.3.4Long2shortWe compared the proposed long2short RL al
165、gorithm with the DPO,shortest rejection sampling,and model mergemethods introduced in the Section 2.4,focusing on the token effi ciency for the long2short problem(X.Chen et al.2024),specifi cally how the obtained long-cot model can benefi t a short model.In Figure 7,k1.5-long represents our long-cot
166、model selected for long2short training.k1.5-short w/rl refers to the short model obtained using the long2short RLtraining.k1.5-short w/dpo denotes the short model with improved token effi ciency through DPO training.k1.5-shortw/merge represents the model after model merging,while k1.5-short w/merge+
167、rs indicates the short model obtainedby applying shortest rejection sampling to the merged model.k1.5-shortest represents the shortest model we obtainedduring the long2short training.As shown in Figure 7,the proposed long2short RL algorithm demonstrates the highesttoken effi ciency compared other me
168、htods such as DPO and model merge.Notably,all models in the k1.5 series(markedin orange)demonstrate superior token effi ciency compared to other models(marked in blue).For instance,k1.5-short w/rl achieves a Pass1 score of 60.8 on AIME2024(averaged over 8 runs)while utilizing only 3,272 tokens on av
169、erage.Similarly,k1.5-shortest attains a Pass1 score of 88.2 on MATH500 while consuming approximately the same numberof tokens as other short models.13Kimi k1.5TECHNICALREPORT25000250050007500 10000 12500 15000 17500Mean Token Length0.600.650.700.750.80Accuracytotaltemp_1.0Trend(slope:2.46e-05)Perfor
170、mance25000250050007500 10000 12500 15000 17500Mean Token Length0.300.350.400.450.500.550.60AccuracyOMNI-MATH500Trend(slope:3.05e-05)Performance25000250050007500 10000 12500 15000 17500Mean Token Length0.7750.8000.8250.8500.8750.9000.9250.950AccuracyMATH500Trend(slope:1.36e-05)Performance250002500500
171、07500 10000 12500 15000 17500Mean Token Length0.00.10.20.30.40.5AccuracyAIMO2024Trend(slope:3.33e-05)Performance25000250050007500 10000 12500 15000 17500Mean Token Length0.10.20.30.40.5AccuracyAIME2024Trend(slope:3.40e-05)Performance25000250050007500 10000 12500 15000 17500Mean Token Length0.650.700
172、.750.800.850.90AccuracyChatGLMMathTrend(slope:2.99e-05)Performance25000250050007500 10000 12500 15000 17500Mean Token Length0.800.820.840.860.880.900.920.940.96AccuracyGAOKAO_bmkTrend(slope:1.49e-05)Performance25000250050007500 10000 12500 15000 17500Mean Token Length0.10.20.30.40.5AccuracyGPQATrend
173、(slope:4.24e-05)PerformanceFigure 6:Model Performance Increases with Response Length400600800100012001400Token Length75.077.580.082.585.087.590.092.595.0AccuracyClaude 3.5deepseek-v3gpt-4-0513k1.5-longk1.5-short w/dpok1.5-short w/mergek1.5-short w/merge+rsk1.5-short w/rlk1.5-shortestqwen25-72B-instM
174、ATH50010002000300040005000Token Length102030405060AccuracyClaude 3.5deepseek-v3gpt-4-0513k1.5-longk1.5-short w/dpok1.5-short w/mergek1.5-short w/merge+rsk1.5-short w/rlk1.5-shortestqwen25-72B-instAIME2024Figure 7:Long2Short Performance.All the k1.5 series demonstrate better token effi ciency compare
175、d to other models.3.5Ablation StudiesScaling of model size and context lengthOur main contribution is the application of RL to enhance the modelscapacity for generating extended CoT,thereby improving its reasoning ability.A natural question arises:how doesthis compare to simply increasing the model
176、size?To demonstrate the effectiveness of our approach,we trained twomodels of different sizes using the same dataset and recorded the evaluation results and average inference lengthsfrom all checkpoints during RL training.These results are shown in Figure 8.Notably,although the larger modelinitially
177、 outperforms the smaller one,the smaller model can achieve comparable performance by utilizing longer CoTsoptimized through RL.However,the larger model generally shows better token effi ciency than the smaller model.Thisalso indicates that if one targets the best possible performance,scaling the con
178、text length of a larger model has a higherupper bound and is more token effi cient.However,if test-time compute has a budget,training smaller models with alarger context length may be viable solutions.Effects of using negative gradientsWe investigate the effectiveness of using ReST(Gulcehre et al.20
179、23)as the policyoptimization algorithm in our setting.The primary distinction between ReST and other RL-based methods including14Kimi k1.5TECHNICALREPORTours is that ReST iteratively refi nes the model by fi tting the best response sampled from the current model,withoutapplying negative gradients to
180、 penalize incorrect responses.As illustrated in Figure 10,our method exhibits superiorsample complexity compared to ReST,indicating that the incorporation of negative gradients markedly enhances themodels effi ciency in generating long CoT.Our method not only elevates the quality of reasoning but al
181、so optimizesthe training process,achieving robust performance with fewer training samples.This fi nding suggests that the choiceof policy optimization algorithm is crucial in our setting,as the performance gap between ReST and other RL-basedmethods is not as pronounced in other domains(Gulcehre et a
182、l.2023).Therefore,our results highlight the importanceof selecting an appropriate optimization strategy to maximize effectiveness in generating long CoT.Sampling strategiesWe further demonstrate the effectiveness of our curriculum sampling strategy,as introducedin Section 2.3.4.Our training datasetD
183、comprises a diverse mix of problems with varying levels of diffi culty.Withour curriculum sampling method,we initially useDfor a warm-up phase and then focus solely on hard questionsto train the model.This approach is compared to a baseline method that employs a uniform sampling strategywithout any
184、curriculum adjustments.As illustrated in Figure 9,our results clearly show that the proposed curriculumsampling method signifi cantly enhances the performance.This improvement can be attributed to the methods ability toprogressively challenge the model,allowing it to develop a more robust understand
185、ing and competency in handlingcomplex problems.By focusing training efforts on more diffi cult questions after an initial general introduction,themodel can better strengthen its reasoning and problem solving capabilities.2000300040005000Mean Response Length(tokens)0.300.350.400.450.500.550.60Accurac
186、yOMNI-MATH500(truncated at 60)Small SizeSmall Size trend(slope:3.33e-05)Large SizeLarge Size trend(slope:6.42e-05)2000300040005000Mean Response Length(tokens)0.10.20.30.40.5AccuracyAIME2024(truncated at 60)Small SizeSmall Size trend(slope:5.90e-05)Large SizeLarge Size trend(slope:8.10e-05)2000300040
187、005000Mean Response Length(tokens)0.7750.8000.8250.8500.8750.9000.9250.950AccuracyMATH500(truncated at 60)Small SizeSmall Size trend(slope:2.45e-05)Large SizeLarge Size trend(slope:2.00e-05)200040006000800010000Mean Response Length(tokens)0.00.10.20.30.40.5AccuracyAIMO2024Small SizeSmall Size trend(
188、slope:3.25e-05)Large SizeLarge Size trend(slope:6.84e-05)Figure 8:Model Performance vs Response Length of Different Model Sizes010203040Iteration0.300.350.400.450.500.550.600.65AccuracyBaseline:Uniform sampling of mixed easy/hard problemsCurriculum:Uniform problems first,then hard problems(transitio
189、n at iter 24)CurriculumTransitionBaseline(Uniform Sampling)Curriculum LearningFigure 9:Analysis of curriculum learning approaches on model performance.4ConclusionsWe present the training recipe and system design of k1.5,our latest multi-modal LLM trained with RL.One of the keyinsights we extract fro
190、m our practice is that the scaling of context length is crucial to the continued improvement ofLLMs.We employ optimized learning algorithms and infrastructure optimization such as partial rollouts to achieveeffi cient long-context RL training.How to further improve the effi ciency and scalability of
191、 long-context RL trainingremains an important question moving forward.15Kimi k1.5TECHNICALREPORT01020304050Step0.300.350.400.45AccuracyOMNI-MATH500ReSTOurs01020304050Step0.760.780.800.820.840.860.880.90AccuracyMATH500ReSTOurs01020304050Step0.000.050.100.150.200.250.30AccuracyAIMO2024ReSTOurs01020304
192、050Step0.150.200.250.300.350.40AccuracyAIME2024ReSTOurs01020304050Step0.660.680.700.720.740.760.78AccuracyChatGLMMathReSTOurs01020304050Step0.760.780.800.820.840.860.88AccuracyGAOKAO_bmkReSTOurs01020304050Step0.120.140.160.180.200.220.24AccuracyGPQAReSTOurs01020304050Step0.660.680.700.720.740.760.78
193、Accuracyk12-biologyReSTOurs01020304050Step0.460.480.500.520.540.560.58Accuracyk12-chemistryReSTOurs01020304050Step0.520.540.560.580.600.62Accuracyk12-physicsReSTOurs01020304050Step0.550.600.650.700.750.80AccuracyKAOYANReSTOurs01020304050Step0.500.520.540.560.580.600.620.640.66AccuracyTotalReSTOursFi
194、gure 10:Comparison with using ReST for policy optimization.Another contribution we made is a combination of techniques that enable improved policy optimization.Specifi cally,we formulate long-CoT RL with LLMs and derive a variant of online mirror descent for robust optimization.We alsoexperiment wit
195、h sampling strategies,length penalty,and optimizing the data recipe to achieve strong RL performance.We show that strong performance can be achieved by long context scaling and improved policy optimization,evenwithout using more complex techniques such as Monte Carlo tree search,value functions,and
196、process reward models.In the future,it will also be intriguing to study improving credit assignments and reducing overthinking without hurtingthe models exploration abilities.We have also observed the potential of long2short methods.These methods largely improve performance of shortCoT models.Moreov
197、er,it is possible to combine long2short methods with long-CoT RL in an iterative way to furtherincrease token effi ciency and extract the best performance out of a given context length budget.ReferencesAbbasi-Yadkori,Yasin et al.“Politex:Regret bounds for policy iteration using expert prediction”.In
198、:InternationalConference on Machine Learning.PMLR.2019,pp.36923702.Ahmadian,Arash et al.“Back to basics:Revisiting reinforce style optimization for learning from human feedback inllms”.In:arXiv preprint arXiv:2402.14740(2024).Ankner,Zachary et al.Critique-out-Loud Reward Models.2024.arXiv:2408.11791
199、 cs.LG.URL:https:/arxiv.org/abs/2408.11791.Berner,Christopher et al.“Dota 2 with large scale deep reinforcement learning”.In:arXiv preprint arXiv:1912.06680(2019).16Kimi k1.5TECHNICALREPORTCassano,Federico,John Gouwar,Daniel Nguyen,Sy Duy Nguyen,et al.“MultiPL-E:A Scalable and ExtensibleApproach to
200、Benchmarking Neural Code Generation”.In:ArXiv(2022).URL:https:/arxiv.org/abs/2208.08227.Cassano,Federico,John Gouwar,Daniel Nguyen,Sydney Nguyen,et al.“MultiPL-E:A Scalable and Polyglot Approachto Benchmarking Neural Code Generation”.In:IEEE Transactions on Software Engineering 49.7(2023),pp.3675369
201、1.DOI:10.1109/TSE.2023.3267446.Chen,Jianlv et al.“Bge m3-embedding:Multi-lingual,multi-functionality,multi-granularity text embeddings throughself-knowledge distillation”.In:arXiv preprint arXiv:2402.03216(2024).Chen,Xingyu et al.“Do NOT Think That Much for 2+3=?On the Overthinking of o1-Like LLMs”.
202、In:arXiv preprintarXiv:2412.21187(2024).Everitt,Tom et al.Reward Tampering Problems and Solutions in Reinforcement Learning:A Causal Influence DiagramPerspective.2021.arXiv:1908.04734 cs.AI.URL:https:/arxiv.org/abs/1908.04734.Gadre,Samir Yitzhak et al.“Datacomp:In search of the next generation of mu
203、ltimodal datasets”.In:Advances inNeural Information Processing Systems 36(2024).Grattafi ori,Aaron et al.The Llama 3 Herd of Models.2024.arXiv:2407.21783 cs.AI.URL:https:/arxiv.org/abs/2407.21783.Gulcehre,Caglar et al.“Reinforced self-training(rest)for language modeling”.In:arXiv preprint arXiv:2308
204、.08998(2023).Hendrycks,Dan et al.“Measuring Massive Multitask Language Understanding”.In:ArXiv abs/2009.03300(2020).URL:https:/arxiv.org/abs/2009.03300.Hoffmann,Jordan et al.Training Compute-Optimal Large Language Models.2022.arXiv:2203.15556 cs.CL.URL:https:/arxiv.org/abs/2203.15556.Huang,Yuzhen et
205、 al.“C-Eval:A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models”.In:ArXiv abs/2305.08322(2023).URL:https:/arxiv.org/abs/2305.08322.Jaech,Aaron et al.“Openai o1 system card”.In:arXiv preprint arXiv:2412.16720(2024).Jain,Naman et al.“LiveCodeBench:Holistic and Contamination F
206、ree Evaluation of Large Language Models for Code”.In:ArXiv abs/2403.07974(2024).URL:https:/arxiv.org/abs/2403.07974.Joulin,Armand et al.“Bag of tricks for effi cient text classifi cation”.In:arXiv preprint arXiv:1607.01759(2016).Kaplan,Jared et al.Scaling Laws for Neural Language Models.2020.arXiv:2
207、001.08361 cs.LG.URL:https:/arxiv.org/abs/2001.08361.Kool,Wouter,Herke van Hoof,and Max Welling.“Buy 4 reinforce samples,get a baseline for free!”In:(2019).Kwon,Woosuk et al.“Effi cient Memory Management for Large Language Model Serving with PagedAttention”.In:Proceedings of the ACM SIGOPS 29th Sympo
208、sium on Operating Systems Principles.2023.Laurenon,Hugo et al.“Obelics:An open web-scale fi ltered dataset of interleaved image-text documents”.In:Advancesin Neural Information Processing Systems 36(2024).Li,Jeffrey et al.“Datacomp-lm:In search of the next generation of training sets for language mo
209、dels”.In:arXiv preprintarXiv:2406.11794(2024).Li,Ming et al.“From quantity to quality:Boosting llm performance with self-guided data selection for instructiontuning”.In:arXiv preprint arXiv:2308.12032(2023).Li,Raymond et al.StarCoder:may the source be with you!2023.arXiv:2305.06161 cs.CL.URL:https:/
210、arxiv.org/abs/2305.06161.Lightman,Hunter et al.“Lets Verify Step by Step”.In:arXiv preprint arXiv:2305.20050(2023).Liu,Wei et al.“What makes good data for alignment?a comprehensive study of automatic data selection in instructiontuning”.In:arXiv preprint arXiv:2312.15685(2023).Lozhkov,Anton et al.St
211、arCoder 2 and The Stack v2:The Next Generation.2024.arXiv:2402.19173 cs.SE.URL:https:/arxiv.org/abs/2402.19173.Lu,Pan et al.“Mathvista:Evaluating mathematical reasoning of foundation models in visual contexts”.In:arXivpreprint arXiv:2310.02255(2023).McAleese,Nat et al.LLM Critics Help Catch LLM Bugs
212、.2024.arXiv:2407.00215 cs.SE.URL:https:/arxiv.org/abs/2407.00215.Mei,Jincheng et al.“On principled entropy exploration in policy optimization”.In:Proceedings of the 28th InternationalJoint Conference on Artificial Intelligence.2019,pp.31303136.Muennighoff,Niklas et al.Scaling Data-Constrained Langua
213、ge Models.2023.arXiv:2305.16264 cs.CL.URL:https:/arxiv.org/abs/2305.16264.Nachum,Ofi r et al.“Bridging the gap between value and policy based reinforcement learning”.In:Advances in neuralinformation processing systems 30(2017).OpenAI.“Learning to reason with LLMs”.In:(2024).URL:https:/ k1.5TECHNICAL
214、REPORTOuyang,Long et al.“Training language models to follow instructions with human feedback”.In:Advances in neuralinformation processing systems 35(2022),pp.2773027744.Pan,Alexander,Kush Bhatia,and Jacob Steinhardt.“The Effects of Reward Misspecifi cation:Mapping and MitigatingMisaligned Models”.In
215、:International Conference on Learning Representations.2022.URL:https:/ et al.“Openwebmath:An open dataset of high-quality mathematical web text”.In:arXiv preprintarXiv:2310.06786(2023).Penedo,Guilherme et al.“The fi neweb datasets:Decanting the web for the fi nest text data at scale”.In:arXiv prepri
216、ntarXiv:2406.17557(2024).Qin,Ruoyu et al.Mooncake:A KVCache-centric Disaggregated Architecture for LLM Serving.2024.arXiv:2407.00079 cs.DC.URL:https:/arxiv.org/abs/2407.00079.Rafailov,Rafael et al.“Direct preference optimization:Your language model is secretly a reward model”.In:Advancesin Neural In
217、formation Processing Systems 36(2024).Schuhmann,Christoph et al.“Laion-5b:An open large-scale dataset for training next generation image-text models”.In:Advances in Neural Information Processing Systems 35(2022),pp.2527825294.Shoeybi,Mohammad et al.Megatron-LM:Training Multi-Billion Parameter Langua
218、ge Models Using Model Parallelism.2020.arXiv:1909.08053 cs.CL.URL:https:/arxiv.org/abs/1909.08053.Silver,David et al.“Mastering the game of go without human knowledge”.In:nature 550.7676(2017),pp.354359.Snell,Charlie et al.“Scaling llm test-time compute optimally can be more effective than scaling m
219、odel parameters”.In:arXiv preprint arXiv:2408.03314(2024).Su,Dan et al.“Nemotron-CC:Transforming Common Crawl into a Refi ned Long-Horizon Pretraining Dataset”.In:arXiv preprint arXiv:2412.02595(2024).Su,Jianlin et al.“Roformer:Enhanced transformer with rotary position embedding”.In:Neurocomputing 5
220、68(2024),p.127063.Team,Gemini et al.Gemini:A Family of Highly Capable Multimodal Models.2024.arXiv:2312.11805 cs.CL.URL:https:/arxiv.org/abs/2312.11805.Tomar,Manan et al.“Mirror descent policy optimization”.In:arXiv preprint arXiv:2005.09814(2020).Vaswani,Ashish et al.“Attention is All you Need”.In:
221、Advances in Neural Information Processing Systems.Ed.byI.Guyon et al.Vol.30.Curran Associates,Inc.,2017.URL:https:/proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.Villalobos,Pablo et al.Will we run out of data?Limits of LLM scaling based on human-generat
222、ed data.2024.arXiv:2211.04325 cs.LG.URL:https:/arxiv.org/abs/2211.04325.Vinyals,Oriol et al.“Grandmaster level in StarCraft II using multi-agent reinforcement learning”.In:nature 575.7782(2019),pp.350354.Wang,Ke et al.“Measuring multimodal mathematical reasoning with math-vision dataset”.In:arXiv pr
223、eprintarXiv:2402.14804(2024).Wei,Haoran et al.“General OCR Theory:Towards OCR-2.0 via a Unifi ed End-to-end Model”.In:arXiv preprintarXiv:2409.01704(2024).Wei,Jason et al.“Chain-of-thought prompting elicits reasoning in large language models”.In:Advances in neuralinformation processing systems 35(20
224、22),pp.2482424837.Wu,Yangzhen et al.“Inference scaling laws:An empirical analysis of compute-optimal inference for problem-solvingwith language models”.In:arXiv preprint arXiv:2408.00724(2024).Xu,Liang et al.“CLUE:A Chinese Language Understanding Evaluation Benchmark”.In:International Conference onC
225、omputational Linguistics.2020.URL:https:/arxiv.org/abs/2004.05986.Yang,Enneng et al.“Model merging in llms,mllms,and beyond:Methods,theories,applications and opportunities”.In:arXiv preprint arXiv:2408.07666(2024).Yao,Shunyu et al.“Tree of thoughts:Deliberate problem solving with large language mode
226、ls”.In:Advances in NeuralInformation Processing Systems 36(2024).Yue,Xiang,Yuansheng Ni,et al.“Mmmu:A massive multi-discipline multimodal understanding and reasoningbenchmark for expert agi”.In:Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2024,pp.95569567.Yue,Xia
227、ng,Xingwei Qu,et al.“Mammoth:Building math generalist models through hybrid instruction tuning”.In:arXiv preprint arXiv:2309.05653(2023).Zhang,Lunjun et al.“Generative verifi ers:Reward modeling as next-token prediction,2024”.In:URL https:/arxiv.org/abs/2408.15240(2024).Zheng,Lianmin et al.SGLang:Ef
228、ficient Execution of Structured Language Model Programs.2024.arXiv:2312.07104cs.AI.URL:https:/arxiv.org/abs/2312.07104.Zhou,Jeffrey et al.“Instruction-Following Evaluation for Large Language Models”.In:ArXiv abs/2311.07911(2023).URL:https:/arxiv.org/abs/2311.07911.18Kimi k1.5TECHNICALREPORTZhu,Wanro
229、ng et al.“Multimodal c4:An open,billion-scale corpus of images interleaved with text”.In:Advances inNeural Information Processing Systems 36(2024).19Kimi k1.5TECHNICALREPORTAppendixAContributionsResearch&DevelopmentAngang DuBofei GaoBowei XingChangjiu JiangCheng ChenCheng LiChenjun XiaoChenzhuang Du
230、Chonghua Liao*Dehao ZhangEnming YuanEnzhe LuFlood SungGuokun LaiHaiqing GuoHan ZhuHao DingHao HuHao YangHao ZhangHaotian YaoHaotian ZhaoHaoyu LuHongcheng GaoHuan YuanHuabin ZhengJingyuan LiuJianlin SuJianzhou WangJin ZhangJunjie YanLidong ShiLonghui YuMengnan DongNeo ZhangNingchen Ma*Qiwei PanQuchen
231、g GongShupeng WeiShaowei LiuTao JiangWeimin XiongWeiran HeWeihao Gao*Weixiao HuangWenhao WuWenyang HeXianqing JiaXingzhe WuXinran XuXinyu ZhouXinxing ZuXuehai PanYang LiYangyang HuYangyang LiuYanru ChenYejie WangYidao QinYibo LiuYiping BaoYifeng Liu*Yulun DuYuzhi WangYuxin WuY.CharlesZaida ZhouZhaoj
232、i WangZhaowei LiZheng ZhangZhexu WangZhiqi HuangZhilin YangZiyao XuZonghan YangData AnnotationChuning TangCongcong WangFengxiang TangGuangda WeiHaoze LiHaozhen YuJia ChenJianhang GuoJie ZhaoJunyan WuLing YeShengling MaSihan CaoSiying HuangXianghui WeiYangyang LiuYing YangZhen ZhuZihao HuangThe listi
233、ng of authors is in alphabetical order based on their fi rst names.Names marked with an asterisk(*)indicatepeople who are no longer part of our team.20Kimi k1.5TECHNICALREPORTBPretrainingReinforcement learning(RL)effi ciency is closely tied to the performance of the underlying base model.Frontiermod
234、els such as Gemini(Team et al.2024)and Llama(Grattafi ori et al.2024)highlight the importance of pretrainingdata quality in achieving high performance.However,many recent open-source models lack full transparency regardingtheir data processing pipelines and recipes,creating challenges for broader co
235、mmunity understanding.While we arenot open-sourcing our proprietary model at this time,we are committed to providing a comprehensive disclosure ofour data pipeline and methodologies.In this section,we focus primarily on the multimodal pretraining data recipe,followed by a brief discussion of the mod
236、el architecture and training stages.B.1Language DataOur pretrain corpus is designed to provide comprehensive and high-quality data for training large language models(LLMs).It encompasses fi ve domains:English,Chinese,Code,Mathematics&Reasoning,and Knowledge.We employsophisticated fi ltering and qual
237、ity control mechanisms for each domain to ensure the highest quality training data.Forall pretrain data,we conducted rigorous individual validation for each data source to assess its specifi c contributionto the overall training recipe.This systematic evaluation ensures the quality and effectiveness
238、 of our diverse datacomposition.English and Chinese textual datawe developed a multi-dimensional quality fi ltering framework that combinesmultiple scoring methods to reduce individual biases and ensure comprehensive quality assessment.Our frameworkincorporates:1.Rule-based filtering:We implement do
239、main-specifi c heuristics to remove problematic content,including duplicatecontent,machine-translated text,and low-quality web scrapes.We also fi lter out documents with excessive specialcharacters,unusual formatting,or spam patterns.2.FastText-based classification:We trained specialized FastText(Jo
240、ulin et al.2016;J.Li et al.2024)models toidentify content quality based on linguistic features and semantic coherence.This helps identify documents withnatural language flow and proper grammatical structure.3.Embedding-based similarity analysis:Using document embeddings(Jianlv Chen et al.2024),we co
241、mputedocument-level similarity scores to identify and remove near-duplicates while preserving semantically valuablevariations.This approach helps maintain diversity in our training corpus.4.LLM-based quality assessment:Following(Penedo et al.2024),we leverage LLMs to score documents based oncoherenc
242、e,informativeness,and potential educational value.This method is particularly effective at identifyingnuanced quality indicators that simpler methods might miss.The fi nal quality score for each document is calculated as a combination of these individual scores.Based on extensiveempirical analysis,w
243、e implement dynamic sampling rates,where high-quality documents are upsampled,whilelow-quality documents are downsampled during training.Code dataThe code data primarily consists of two categories.For the pure code data derived from code fi les,weadhered to the methodology of BigCode(R.Li et al.2023
244、;Lozhkov et al.2024)and conducted a comprehensivepreprocessing of the dataset.Initially,we eliminated miscellaneous languages and applied a rule-based cleaningprocedure to enhance data quality.Subsequently,we addressed language imbalance through strategic samplingtechniques.Specifi cally,markup lang
245、uages such as JSON,YAML,and YACC were down-sampled,while 32 majorprogramming languages,including Python,C,C+,Java,and Go,were up-sampled to ensure a balanced representation.Regarding the text-code interleaved data sourced from various data sources,we use an embedding-based method torecall high-quali
246、ty data.This approach ensures the diversity of the data and maintains its high quality.Math&Reasoning dataThe mathematics and reasoning component of our dataset is crucial for developing stronganalytical and problem-solving capabilities.The mathematical pre-training data are mainly retrieved from we
247、b textand PDF documents collected from publicly available internet sources.(Paster et al.2023)Initially,we discovered thatour general-domain text extraction,data cleaning process and OCR models exhibited high false negative rates in themathematical domain.Therefore,we fi rst developed specialized da
248、ta cleaning procedures and OCR models specifi callyfor mathematical content,aiming to maximize the recall rate of mathematical data.Subsequently,we implemented atwo-stage data cleaning process:1.Using FastText model for initial cleaning to remove most irrelevant data.21Kimi k1.5TECHNICALREPORT2.Util
249、izing a fi ne-tuned language model to further clean the remaining data,resulting in high-quality mathematicaldata.Knowledge dataThe knowledge corpus is meticulously curated to ensure a comprehensive coverage in academicdisciplines.Our knowledge base primarily consists of academic exercises,textbooks
250、,research papers,and other generaleducational literature.A signifi cant portion of these materials is digitized through OCR processing,for which we havedeveloped proprietary models optimized for academic content,particularly for handling mathematical formulas andspecial symbols.We employ internal la
251、nguage models to annotate documents with multi-dimensional labels,including:1.OCR quality metrics to assess recognition accuracy2.Educational value indicators measuring pedagogical relevance3.Document type classifi cation(e.g.,exercises,theoretical materials)Based on these multi-dimensional annotati
252、ons,we implement a sophisticated fi ltering and sampling pipeline.Firstand foremost,documents are fi ltered through OCR quality thresholds.Our OCR quality assessment framework placesspecial attention on detecting and fi ltering out common OCR artifacts,particularly repetitive text patterns that ofte
253、nindicate recognition failures.Beyond basic quality control,we carefully evaluate the educational value of each document through our scoring system.Documents with high pedagogical relevance and knowledge depth are prioritized,while maintaining a balance betweentheoretical depth and instructional cla
254、rity.This helps ensure that our training corpus contains high-quality educationalcontent that can effectively contribute to the models knowledge acquisition.Finally,to optimize the overall composition of our training corpus,the sampling strategy for different document typesis empirically determined
255、through extensive experimentation.We conduct isolated evaluations to identify documentsubsets that contribute most signifi cantly to the models knowledge acquisition capabilities.These high-value subsetsare upsampled in the fi nal training corpus.However,to maintain data diversity and ensure model g
256、eneralization,wecarefully preserve a balanced representation of other document types at appropriate ratios.This data-driven approachhelps us optimize the trade-off between focused knowledge acquisition and broad generalization capabilities.B.2Multimodal DataOur multi-modal pretraining corpus is desi
257、gned to provide high-quality data that enables models to process andunderstand information from multiple modalities,including text,images,and videos.To this end,we also have curatedhigh-quality data from fi ve categoriescaptioning,interleaving,OCR(Optical Character Recognition),knowledge,andgeneral
258、question answeringto form the corpus.When constructing our training corpus,we developed several multi-modal data processing pipelines to ensure dataquality,encompassing fi ltering,synthesis,and deduplication.Establishing an effective multi-modal data strategy iscrucial during the joint training of v
259、ision and language,as it both preserves the capabilities of the language model andfacilitates alignment of knowledge across diverse modalities.We provide a detailed description of these sources in this section,which is organized into the following categories:Caption dataOur caption data provides the
260、 model with fundamental modality alignment and a broad range of worldknowledge.By incorporating caption data,the multi-modal LLM gains wider world knowledge with high learningeffi ciency.We have integrated various open-source Chinese and English caption datasets like(Schuhmann et al.2022;S.Y.Gadre e
261、t al.2024)and also collected substantial in-house caption data from multiple sources.However,throughoutthe training process,we strictly limit the proportion of synthetic caption data to mitigate the risk of hallucinationstemming from insuffi cient real-world knowledge.For general caption data,we fol
262、low a rigorous quality control pipeline that avoids duplication and maintain highimage-text correlation.We also vary image resolution during pretraining to ensure that the vision tower remainseffective when processing images of both high-and low-resolution.Image-text interleaving data During the pre
263、training phase,model is benefi t from interleaving data for many aspects,for example,multi-image comprehension ability can be boosted by interleaving data;interleaving data always providedetailed knowledge for the given image;a longer multi-modal context learning ability can also be gained by theint
264、erleaving data.Whats more,we also fi nd that interleaving data can contributes positively to maintaining the modelslanguage abilities.Thus,image-text interleaving data is an important part in our training corpus.Our multi-modal22Kimi k1.5TECHNICALREPORTcorpus considered open-sourced interleave datas
265、ets like(Zhu et al.2024;Laurenon et al.2024)and also constructedlarge-scale in-house data using resources like textbooks,webpages and tutorials.Further,we also fi nd that synthesizingthe interleaving data benefi ts the performance of multi-modal LLM for keeping the text knowledges.To ensure eachimag
266、es knowledge is suffi ciently studied,for all the interleaving data,other than the standard fi ltering,deduping andother quality control pipeline,we also integrated a data reordering procedure for keeping all the image and text in thecorrect order.OCR data Optical Character Recognition(OCR)is a wide
267、ly adopted technique that converts text from images into aneditable format.In k1.5,a robust OCR capability is deemed essential for better aligning the model with human values.Accordingly,our OCR data sources are diverse,ranging from open-source to in-house datasets,and encompassing bothclean and aug
268、mented images.In addition to the publicly available data,we have developed a substantial volume of in-house OCR datasets,coveringmultilingual text,dense text layouts,web-based content,and handwritten samples.Furthermore,following the principlesoutlined in OCR 2.0(H.Wei et al.2024),our model is also
269、equipped to handle a variety of optical image types,includingfi gures,tables,geometry diagrams,mermaid plots,and natural scene text.We apply extensive data augmentationtechniquessuch as rotation,distortion,color adjustments,and noise additionto enhance the models robustness.Asa result,our model achi
270、eves a high level of profi ciency in OCR tasks.Knowledge data The concept of multi-modal knowledge data is analogous to the previously mentioned text pretrainingdata,except here we focus on assembling a comprehensive repository of human knowledge from diverse sourcesto further enhance the models cap
271、abilities.For example,carefully curated geometry data in our dataset is vital fordeveloping visual reasoning skills,ensuring the model can interpret the abstract diagrams created by humans.Our knowledge corpus adheres to a standardized taxonomy to balance content across various categories,ensuringdi
272、versity in data sources.Similar to text-only corpora,which gather knowledge from textbooks,research papers,andother academic materials,multi-modal knowledge data employs both a layout parser and an OCR model to processcontent from these sources.While we also include fi ltered data from internet-base
273、d and other external resources.Because a signifi cant portion of our knowledge corpus is sourced from internet-based materials,infographics can causethe model to focus solely on OCR-based information.In such cases,relying exclusively on a basic OCR pipeline maylimit training effectiveness.To address
274、 this,we have developed an additional pipeline that better captures the purelytextual information embedded within images.General QA Data During the training process,we observed that incorporating a substantial volume of high-quality QAdatasets into pretraining offers signifi cant benefi ts.Specifi c
275、ally,we included rigorous academic datasets addressingtasks such as grounding,table/chart question answering,web agents,and general QA.In addition,we compiled a largeamount of in-house QA data to further enhance the models capabilities.To maintain balanced diffi culty and diversity,we applied scorin
276、g models and meticulous manual categorization to our general question answering dataset,resulting inoverall performance improvements.B.3Model ArchitectureKimi k-series models employ a variant of the Transformer decoder(Vaswani et al.2017)that integrates multimodalcapabilities alongside improvements
277、in architecture and optimization strategies,illustrated in Figure 11.Theseadvancements collectively support stable large-scale training and effi cient inference,tailored specifi cally to large-scalereinforcement learning and the operational requirements of Kimi users.Extensive scaling experiments in
278、dicate that most of the base model performance comes from improvements in thequality and diversity of the pretraining data.Specifi c details regarding model architecture scaling experiments liebeyond the scope of this report and will be addressed in future publications.B.4Training StagesThe Kimi k1.
279、5 model is trained in three stages:the vision-language pretraining stage,the vision-language cooldownstage,and the long-context activation stage.Each stage of the Kimi k1.5 models training focuses on a particularcapability enhancement.Vision-language pretraining stageIn this stage,the model is fi rs
280、tly trained solely on language data,establishinga robust language model foundation.Then the model is gradually introduced to interleaved vision-language data,acquiring multimodal capabilities.The visual tower is initially trained in isolation without updating the language modelparameters,then we unf
281、reeze the language model layers,and ultimately increase the proportion of vision-text data23Kimi k1.5TECHNICALREPORTTransformerText SequencesInterleave Image-text SequencesLarge Scale Reinforcement LearningFigure 11:Kimi k1.5 supports interleaved images and text as input,leveraging large-scale reinf
282、orcement learning toenhance the models reasoning capabilities.to 30%.The fi nal data mixtures and their respective weights were determined through ablation studies conducted onsmaller models.Vision-language cooldown stageThe second stage serves as a cooldown phase,where the model is continuetrained
283、with high-quality language and vision-language datasets to ensure superior performance.Through empiricalinvestigation,we observed that the incorporation of synthetic data during the cooldown phase yields signifi cantperformance improvements,particularly in mathematical reasoning,knowledge-based task
284、s,and code generation.TheEnglish and Chinese components of the cooldown dataset are curated from high-fi delity subsets of the pre-trainingcorpus.For math,knowledge,and code domains,we employ a hybrid approach:utilizing selected pre-trainingsubsets while augmenting them with synthetically generated
285、content.Specifi cally,we leverage existing mathematical,knowledge and code corpora as source material to generate question-answer pairs through a proprietary language model,implementing rejection sampling techniques to maintain quality standards(Yue,Qu,et al.2023;D.Su et al.2024).These synthesized Q
286、A pairs undergo comprehensive validation before being integrated into the cooldown dataset.Long-context activation stageFinally,in the third stage,k1.5 is trained with upsampled long-context cooldowndata,enabling it to process extended sequences and support tasks that demand longer context.To ensure
287、 excellentlong-text capabilities of the base model,we upsampled long-context data and used 40%full attention data and 60%partial attention data during long context training.The full attention data came partly from high-quality natural data andpartly from synthetic long context Q&A and summary data.T
288、he partial attention data came from uniform sampling ofcooldown data.The RoPE frequency(J.Su et al.2024)was set to 1,000,000.During this stage,we gradually extendedlength activation training by increasing the maximum sequence length from 4,096 to 32,768,and ultimately to 131,072.CEvaluation DetailsC
289、.1Text BenchmarkMMLU(Hendrycks et al.2020)covers 57 subjects in STEM,the humanities,social sciences,and more.It rangesin diffi culty from an elementary level to an advanced professional level,and it tests both world knowledge andproblem-solving ability.IF-Eval(J.Zhou et al.2023)is a benchmark for ev
290、aluating large language models ability to follow verifi ableinstructions.There are 500+prompts with instructions such as write an article with more than 800 words,etc.Due toa version shift,the number of IFEval reported in Table 3 derived from an intermediate model.We will update the scoresbased on t
291、he fi nal model.CLUEWSC(L.Xu et al.2020)is a coreference resolution task in CLUE benchmark,requiring models to determine ifa pronoun and a noun phrase in a sentence co-refer,with data from Chinese fi ction books.C-EVAL(Y.Huang et al.2023)is a comprehensive Chinese evaluation suite for assessing adva
292、nced knowledge andreasoning abilities of foundation models.It includes 13,948 multiple-choice questions across 52 disciplines and fourdiffi culty levels.C.2Reasoning BenchmarkHumanEval-Mul is a subset of Multipl-E(Cassano,Gouwar,D.Nguyen,S.D.Nguyen,et al.2022).MultiPL-Eextends the HumanEval benchmar
293、k and MBPP benchmark to 18 languages that encompass a range of programming24Kimi k1.5TECHNICALREPORTparadigms and popularity.We choose HumanEval translations in 8 mainstream programming languages(Python,Java,Cpp,C#,JavaScript,TypeScript,PHP,and Bash).LiveCodeBench(Jain et al.2024)serves as a compreh
294、ensive and contamination-free benchmark for assessing largelanguage models(LLMs)in coding tasks.It features live updates to prevent data contamination,holistic evaluationacross multiple coding scenarios,high-quality problems and tests,and balanced problem diffi culty.We test short-CoTmodel with ques
295、tions from 2408-2411(release v4),and long-CoT model with questions from 2412-2502(release v5).AIME 2024 comprises the competition questions for the AIME in 2024.The AIME is a prestigious,invitation-onlymath contest for top high school students,assessing advanced math skills and requiring solid found
296、ation and highlogical thinking.MATH-500(Lightman et al.2023)is a comprehensive mathematics benchmark that contains 500 problems on variousmathematicstopicsincludingalgebra,calculus,probability,andmore.Testsbothcomputationalabilityandmathematicalreasoning.Higher scores indicate stronger mathematical
297、problem-solving capabilities.Codeforces is a well-known online judge platform and serves as a popular testbed for evaluating long-CoT codingmodels.To achieve higher rankings in the Div2 and Div3 competitions,we utilize majority voting on the code snippetsgenerated by the k1.5 long-CoT model,employin
298、g test cases that are also generated by the same model.The percentileof the codeforce ELO rating was extracted from OpenAI Day12 talk3c.C.3Image BenchmarkMMMU(Yue,Ni,et al.2024)encompasses a carefully curated collection of 11.5K multimodal questions sourcedfrom college exams,quizzes,and textbooks.Th
299、ese questions span six major academic fi elds:Art&Design,Business,Science,Health&Medicine,Humanities&Social Science,and Tech&Engineering.MATH-Vision(MATH-V)(K.Wang et al.2024)is a carefully curated collection of 3,040 high-quality mathematicalproblems with visual contexts that are sourced from real
300、math competitions.It covers 16 distinct mathematicaldisciplines and is graded across 5 levels of diffi culty.This dataset offers a comprehensive and diverse set of challenges,making it ideal for evaluating the mathematical reasoning abilities of LMMs.MathVista(Lu et al.2023)is a benchmark that integrates challenges from a variety of mathematical and visual tasks,demanding participants to exhibit fi ne-grained,deep visual understanding along with compositional reasoning tosuccessfully complete the tasks.3https:/