3-1 一个图像到文本的生成模型以及在多模态领域的应用.pdf-三个皮匠报告

3-1 一个图像到文本的生成模型以及在多模态领域的应用.pdf

当前位置：首页 > 报告详情

3-1 一个图像到文本的生成模型以及在多模态领域的应用.pdf

上传人：云闲编号：102334 2021-01-01 PDF PDF 19页 2.73MB

该报告所属合集： DataFunSummit：2022NLP峰会嘉宾演讲PPT合集

打包下载报告合集

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载报告到电脑，查找使用更方便

VIP专享文档

书签

已收藏

版权投诉

/19

立即下载

word格式文档无特别注明外均可编辑修改，预览文件经过压缩，下载原文更清晰！

三个皮匠报告文库所有资源均是客户上传分享，仅供网友学习交流，未经上传用户书面授权，请勿作商用。

《3-1 一个图像到文本的生成模型以及在多模态领域的应用.pdf》由会员分享，可在线阅读，更多相关《3-1 一个图像到文本的生成模型以及在多模态领域的应用.pdf（19页珍藏版）》请在三个皮匠报告上搜索。

1、GIT:A Generative Image-to-text Transformer for Vision and LanguageJianfeng WangPrinciple ResearcherMicrosoft Cloud&AIA cartoon illustration of a pikachu and a mouse talking to each other.a cell phone screen shows the time of 1:44 wednesday,november 4.A rollpack sign that says$14.88 on itA white back

2、ground with the numbers 49a785 and c392z6.A text that says marlowprocesses are stochastic processes,traditionally in discrete or continuous time,that have the marlow property.A gold and brown sign that says university of colorado 1876.GIT:A Generative Image-to-text TransformerTokenize&EmbedMulti-hea

3、d self-attentionFeed forwardaradioBOS.EOSa tecsun radio withthe time of 12:54.Text decoder(a)Pre-training/captioningImage encoder54tscsunaradio.54tscsun(b)VQAQ:what time is it?A:12:54whatBOSEOSText decoderittimeis?12:5412:54(c)VideoImage encoderImage encoderFrame 6Frame 1temporal embedding 1temporal

4、 embedding 6Jianfeng Wang,Zhengyuan Yang,Xiaowei Hu,Linjie Li,Kevin Lin,Zhe Gan,Zicheng Liu,Ce Liu,Lijuan Wang;GIT:A Generative Image-to-text Transformer for Vision and Language;arxivone image encoder(Florence/CoSwin)+one text decoderpretrain on 0.8 billion image-text pairsRelation with existing app

5、roaches vs Flamingo/CocaRelation with existing approaches Novel object captioning(nocaps)Existing approaches Tags as extra input Object detector/classifier/CLIP Ours No such dependency Scene-text related tasks Existing approaches OCR text as extra input Ours No such dependency vs Flamingo/Coca GIT(o

6、urs):smaller model size/fewer data,better performancemodelDataCOCOnocapsTextVQAVizWiz-QAVATEXYouCook2Flamingo(Deepmind)80B2.3B+27M/video138.1-54.165.484.2118.6Coca(Google)2.1B4.8B143.6120.6-GIT(ours)0.7B0.8B144.8123.459.867.593.8129.8Data&model scaling Data 4M

报告速读

本文介绍了一种名为GIT的生成式图像到文本转换模型。GIT是一个具有生成能力的图像到文本转换器模型，采用了多头自注意力机制和前馈神经网络。该模型在图像/视频标题和问答任务上取得了新的最先进性能，并在场景文本识别方面也表现出色。GIT具有较小的模型大小和较少的训练数据，但性能更优。与现有的Flamingo和Coca模型相比，GIT在多个任务上表现更好。该模型在12个图像/视频标题和问答任务上取得了新的最先进性能，并在场景文本识别方面也表现出色。GIT预测的标题具有多样化的实体和概念，实现了开放词汇的视觉问答。

"GIT模型如何实现图像到文本的转换？" "GIT在哪些视觉语言任务上取得了突破性成果？" "如何利用GIT模型实现无词汇限制的图像分类？"

3-1 一个图像到文本的生成模型以及在多模态领域的应用.pdf

相关报告