《DeepSeek VL技术报告(英文版)(33页).pdf》由会员分享,可在线阅读,更多相关《DeepSeek VL技术报告(英文版)(33页).pdf(33页珍藏版)》请在三个皮匠报告上搜索。
1、DeepSeek-VL:Towards Real-World Vision-LanguageUnderstandingHaoyu Lu*1,Wen Liu*1,Bo Zhang*1,Bingxuan Wang1,Kai Dong1,Bo Liu1,Jingxiang Sun1,Tongzheng Ren1,Zhuoshu Li1,Hao Yang1,Yaofeng Sun1,Chengqi Deng1,Hanwei Xu1,Zhenda Xie1,Chong Ruan11DeepSeek-AIneal,liuwen,https:/ present DeepSeek-VL,an open-sou
2、rce Vision-Language(VL)Model designed for real-worldvision and language understanding applications.Our approach is structured around three keydimensions:Data Construction:We strive to ensure our data is diverse,scalable and extensively coversreal-world scenarios including web screenshots,PDFs,OCR,ch
3、arts,and knowledge-basedcontent(expert knowledge,textbooks),aiming for a comprehensive representation of practicalcontexts.Further,we create a use case taxonomy from real user scenarios and construct aninstruction-tuning dataset accordingly.The fine-tuning with this dataset substantially improvesthe
4、 models user experience in practical applications.Model Architecture:Considering efficiency and the demands of most real-world scenarios,DeepSeek-VL incorporates a hybrid vision encoder that efficiently processes high-resolutionimages(1024 x 1024)within a fixed token budget,while maintaining a relat
5、ively low computa-tional overhead.This design choice ensures the models ability to capture critical semantic anddetailed information across various visual tasks.Training Strategy:We posit that a proficient Vision-Language Model should,foremost,possess strong language abilities.To ensure the preserva
6、tion of LLM capabilities duringpretraining,we investigate an effective VL pretraining strategy by integrating LLM trainingfrom the beginning and carefully managing the competitive dynamics observed between visionand language modalities.Starting with a focus on text,we gradually adjust the ratio to f