《通义千问Qwen:2025 Qwen3-VL 技术报告(英文版)(42页).pdf》由会员分享,可在线阅读,更多相关《通义千问Qwen:2025 Qwen3-VL 技术报告(英文版)(42页).pdf(42页珍藏版)》请在三个皮匠报告上搜索。
1、November 27,2025Qwen3-VL Technical ReportQwen Teamhttps:/chat.qwen.aihttps:/huggingface.co/Qwenhttps:/ introduce Qwen3-VL,the most capable visionlanguage model in the Qwen series todate,achieving superior performance across a broad range of multimodal benchmarks.It natively supports interleaved cont
2、exts of up to 256K tokens,seamlessly integrat-ing text,images,and video.The model family includes both dense(2B/4B/8B/32B)and mixture-of-experts(30B-A3B/235B-A22B)variants to accommodate diverse la-tencyquality trade-offs.Qwen3-VL delivers three core pillars:(i)markedly strongerpure-text understandi
3、ng,surpassing comparable text-only backbones in several cases;(ii)robust long-context comprehension with a native 256K-token window for both textand interleaved multimodal inputs,enabling faithful retention,retrieval,and cross-referencing across long documents and videos;and(iii)advanced multimodal
4、reasoningacross single-image,multi-image,and video tasks,demonstrating leading performanceon comprehensive evaluations such as MMMU and visual-math benchmarks(e.g.,Math-Vista and MathVision).Architecturally,we introduce three key upgrades:(i)an enhancedinterleaved-MRoPE for stronger spatialtemporal
5、modeling across images and video;(ii)DeepStack integration,which effectively leverages multi-level ViT features to tightenvisionlanguage alignment;and(iii)text-based time alignment for video,evolving fromT-RoPE to explicit textual timestamp alignment for more precise temporal grounding.Tobalance tex
6、t-only and multimodal learning objectives,we apply square-root reweight-ing,which boosts multimodal performance without compromising text capabilities.We extend pretraining to a context length of 256K tokens and bifurcate post-traininginto non-thinking and thinking variants to address distinct appli