《MiniGPT-4:使用先进的大型语言模型提升 AI 视觉语言理解能力.pdf》由会员分享,可在线阅读,更多相关《MiniGPT-4:使用先进的大型语言模型提升 AI 视觉语言理解能力.pdf(27页珍藏版)》请在三个皮匠报告上搜索。
1、DataFunSummit#2023MiniGPT-4:Enhancing Vision-Language Understanding with Advanced Large Language Models 朱德尧阿卜杜拉国王科技大学博士生GPT-4 achieves next-level vision-language abilities like Explaining the funny part of a memeMotivation2GPT-4 achieves next-level vision-language abilities like Creating a website f
2、rom a draftMotivation3Such abilities are never shown in previous SOTA methods like DeepMinds Flamingo1or Saleforces Blip-22Nobody knows how they do it!Motivation41 Alayrac J B,Donahue J,Luc P,et al.Flamingo:a visual language model for few-shot learningJ.NeurIPS 20222 Li J,Li D,Savarese S,et al.Blip-
3、2:Bootstrapping language-image pre-training with frozen image encoders and large language models.arXiv preprint 2023What is the secret of GPT-4s Vision-Language Abilities?Fancy large datasets with data like draft-to-website image pairs?Secret model architectures?Or just with an advanced large langua
4、ge model?Motivation5A conversation system toward enriched image descriptionsBlip-2 cannot describe images with great detailsBut it can provide image details given proper questionsPrompt ChatGPT to keep asking questions about image detailsUse ChatGPT to summarize theconversation into the final descri
5、ptionBefore MiniGPT-4:ChatCaptioner6Deyao Zhu,Jun Chen,Kilichbeck Haydarov,et al.ChatGPT Asks,BLIP-2 Answers:Automatic Questioning Towards Enriched Visual Descriptions.arXiv preprint 2023The vision part of Blip-2 can provide rich informationBut the language part is notstrong enough to follow usersin
6、structionWhat we learn from ChatCaptioner7Scan MeGitHub 300+It might be possible to simply aligning Blip-2s vision component with a better language model to achieve a much better vision-language instruction following ability.Is it the secret of GPT-4s vision-language abilities?Before MiniGPT-48What