1、KIMI K1.5:SCALINGREINFORCEMENTLEARNING WITHLLMSTECHNICALREPORT OFKIMI K1.5Kimi TeamABSTRACTLanguage model pretraining with next token prediction has proved effective for scaling compute butis limited to the amount of available training data.Scaling reinforcement learning(RL)unlocks a newaxis for the
2、 continued improvement of artifi cial intelligence,with the promise that large languagemodels(LLMs)can scale their training data by learning to explore with rewards.However,priorpublished work has not produced competitive results.In light of this,we report on the training practiceof Kimi k1.5,our la
3、test multi-modal LLM trained with RL,including its RL training techniques,multi-modal data recipes,and infrastructure optimization.Long context scaling and improved policyoptimization methods are key ingredients of our approach,which establishes a simplistic,effectiveRL framework without relying on
4、more complex techniques such as Monte Carlo tree search,valuefunctions,and process reward models.Notably,our system achieves state-of-the-art reasoningperformance across multiple benchmarks and modalitiese.g.,77.5 on AIME,96.2 on MATH500,94-th percentile on Codeforces,74.9 on MathVistamatching OpenA
5、Is o1.Moreover,wepresent effective long2short methods that use long-CoT techniques to improve short-CoT models,yielding state-of-the-art short-CoT reasoning resultse.g.,60.8 on AIME,94.6 on MATH500,47.3on LiveCodeBenchoutperforming existing short-CoT models such as GPT-4o and Claude Sonnet3.5 by a l
6、arge margin(up to+550%).Kimi k1.5 long-CoTOpenAI o1OpenAI o1-miniQwQ-32B PreviewQVQ-72B-PreviewVision74.974.97171.4MathVista(Pass1)707077.370.3MMMU(Pass1)Code9494948862Codeforces(Percentile)62.562.567.253.140.6LiveCodeBench v5 24.12-25.2(Pass1)Math96.296.294.89090.6MATH 500(EM)77.577.574.463.650AIME