1、2025-08-19Ovis2.5 Technical ReportOvis Team,Alibaba Grouphttps:/huggingface.co/AIDC-AI/Ovis2.5-9Bhttps:/ present Ovis2.5,a successor to Ovis2 designed for native-resolution visual perceptionand strong multimodal reasoning.Ovis2.5 integrates a native-resolution vision transformerthat processes images
2、 at their native,variable resolutions,avoiding the degradationfrom fixed-resolution tiling and preserving both fine detail and global layoutcrucial forvisually dense content like complex charts.To strengthen reasoning,we train the model tomove beyond linear chain-of-thought and perform reflectioninc
3、luding self-checking andrevision.This advanced capability is exposed as an optional“thinking mode”at inferencetime,allowing users to trade latency for enhanced accuracy on difficult inputs.The modelis trained via a comprehensive five-phase curriculum that progressively builds its skills.The process
4、begins with foundational visual and multimodal pretraining,advances throughlarge-scale instruction tuning,and culminates in alignment and reasoning enhancementusing DPO and GRPO.To scale these upgrades efficiently,we employ multimodal datapacking and hybrid parallelism,yielding a significant end-to-
5、end speedup.We releasetwo open-source models:Ovis2.5-9B and Ovis2.5-2B.The latter continues the“smallmodel,big performance”philosophy of Ovis2,making it ideal for resource-constrained,on-device scenarios.On the OpenCompass multimodal leaderboard,Ovis2.5-9B averages78.3,marking a substantial improvem
6、ent over its predecessor,Ovis2-8B,and achievingstate-of-the-art results among open-source MLLMs in the sub-40B parameter range;Ovis2.5-2B scores 73.9,establishing SOTA for its size.Beyond aggregate scores,Ovis2.5achieves leading results on STEM benchmarks,exhibits strong capabilities on groundingand