《崔淦渠-PRIME:结合隐式过程奖励的大模型强化学习.pdf》由会员分享,可在线阅读,更多相关《崔淦渠-PRIME:结合隐式过程奖励的大模型强化学习.pdf(49页珍藏版)》请在三个皮匠报告上搜索。
1、ML-SummitML-Summitwww.cpp-www.ml-summit.orgwww.gosim.orgwww.pm-summit.orgML-SummitML-SummitML-SummitML-SummitML-SummitML-Summit崔崔淦淦渠渠 上上海海人人工工智智能能实实验验室室青青年年科科学学家家上海人工智能实验室青年科学家,博士毕业于清华大学计算机系,导师为刘知远副教授。研究方向为大语言模型的对齐与强化学习技术。在ICML,NeurIPS,ICLR,ACL,KDD等国际人工智能顶级会议与期刊上发表论文十余篇,谷歌学术引用超8000次。演演讲讲主主题题:P PR RI
2、 IMME E:结结合合隐隐式式过过程程奖奖励励的的大大模模型型强强化化学学习习ML-SummitML-Summit2025 全球机器学习技术大会强化学习与隐式过程奖励从从DeepSeek R1讲起上海人工智能实验室 崔淦渠ML-SummitML-Summit目录CONTENTSWhy RL?DeepSeek-R1Challenge of Process RewardImplicit PRM&PRIMEML-SummitML-SummitWhy RL?01ML-SummitML-SummitIlya Sutskever at NeurIPS 2024Go beyond imitationPre
3、-training will End?ML-SummitML-SummitThe next Scaling Law?Why Reinforcement LearningOne thing that should be learned from the bitter lesson is the great power of general purpose methods,of methods that continue to scale with increased computation even as the available computation becomes very great.
4、The two methods that seem to scale arbitrarily in this way are search and learning.Richard Sutton(ACM Turing Award)The Bitter LessonPretraining and finetuningReinforcement learningML-SummitML-SummitSome of the AI breakthroughs in the past 10 yearsWhy Reinforcement LearningAlphaGoAlphaStarAlphaProofA
5、lphaTensorML-SummitML-SummitSome of the AI breakthroughs in the past one yearWhy Reinforcement LearningOpenAI o1DeepSeek R1ML-SummitML-SummitRecap:Reinforcement Learninghttps:/lilianweng.github.io/posts/2018-02-19-rl-overview/The agent takes actions in an environment to maximize cumulative rewardsML
6、-SummitML-SummitDeepSeek-R102ML-SummitML-SummitDeepSeek-R1Key factors in scalable RL for LLMsA strong base policyDeepSeek-V3 671BUnhackable,accurate rewardsSimple policy gradient works wellGRPO REINFORCE+Avg.as baselineGuo et al.DeepSeek-R1:Incentivizing Reasoning Capability in LLMs via Reinforcemen