1、Seed1.5-VL Technical ReportByteDance SeedSee Contributions and Acknowledgments section for a full author list.AbstractWe present Seed1.5-VL,a vision-language foundation model designed to advance general-purposemultimodal understanding and reasoning.Seed1.5-VL is composed with a 532M-parameter vision
2、encoder and a Mixture-of-Experts(MoE)LLM of 20B active parameters.Despite its relativelycompact architecture,it delivers strong performance across a wide spectrum of public VLMbenchmarks and internal evaluation suites,achieving the state-of-the-art performance on 38 outof 60 public benchmarks.Moreov
3、er,in agent-centric tasks such as GUI control and gameplay,Seed1.5-VL outperforms leading multimodal systems,including OpenAI CUA and Claude 3.7.Beyond visual and video understanding,it also demonstrates strong reasoning abilities,makingit particularly effective for multimodal reasoning challenges s
4、uch as visual puzzles.We believethese capabilities will empower broader applications across diverse tasks.In this report,we mainlyprovide a comprehensive review of our experiences in building Seed1.5-VL across model design,data construction,and training at various stages,hoping that this report can
5、inspire furtherresearch.Seed1.5-VL is now accessible on Volcano Enginea.Date:June 13,2025Correspondence:aModel ID:doubao-1-5-thinking-vision-pro-2504281Contents1Introduction.42Architecture.52.1Vision Encoder.52.1.1Architecture.62.1.2ViT Pre-training Stage.62.2Video Encoding.73Pre-training.83.1Pre-tr
6、aining Data.83.1.1Generic Image-Text Pairs&Knowledge Data.83.1.2Optical Character Recognition(OCR).93.1.3Visual Grounding&Counting.103.1.43D Spatial Understanding.113.1.5Video.113.1.6Science,Technology,Engineering,and Mathematics(STEM).123.1.7Graphical User Interface(GUI).123.2Training Recipe.133.3S