1、Baichuan-M2:Scaling Medical Capability with LargeVerifier SystemBaichuan-M2 TeamAbstractAs large language models(LLMs)advance in conversational and reasoning capabil-ities,their practical application in healthcare has become a critical research focus.However,there is a notable gap between the perfor
2、mance of medical LLMs on staticbenchmarks such as USMLE and their utility in real-world clinical decision-making.This discrepancy arises because traditional exams fail to capture the dynamic,in-teractive nature of medical consultations.To address this challenge,we introducea novel dynamic verifi cat
3、ion framework that moves beyond static answer verifi er,establishing a large-scale,high-fi delity interactive reinforcement learning system.Our framework comprises two key components:a Patient Simulator that createsrealistic clinical environments using de-identifi ed medical records,and a ClinicalRu
4、brics Generator that dynamically produces multi-dimensional evaluation metrics.Building on this foundation,we develop Baichuan-M2,a 32B-parameter medicalaugmented reasoning model trained through a multi-stage reinforcement learningstrategy with an improved Group Relative Policy Optimization(GRPO)alg
5、orithm.Evaluated on HealthBench,Baichuan-M2 outperforms all other open-source mod-els and most advanced closed-source counterparts,achieving a score above 32on the challenging HealthBench Hard benchmarkpreviously exceeded only byGPT-5.Our work demonstrates that robust dynamic verifi er system is ess
6、ential foraligning LLM capabilities with practical clinical applications,establishing a newPareto front in the performance-parameter trade-off for medical AI deployment.1IntroductionAs the conversational and reasoning capabilities of large language models(LLMs)continue toadvance,there is increasing