《张驰-基于多模态大语言模型的GUI智能体.pdf》由会员分享,可在线阅读,更多相关《张驰-基于多模态大语言模型的GUI智能体.pdf(37页珍藏版)》请在三个皮匠报告上搜索。
1、基于多模态大语言模型的GUI智能体张驰 博士腾讯QQ光影实验室/西湖大学 AGI 实验室演讲嘉宾张驰入选美国斯坦福大学发布的2023全球前2顶尖科学家榜单张驰博士,现就职于腾讯公司担任研究科学家。将在 2024 年秋季加入西湖大学担任博导助理教授,并创建 AGI 实验室。2021年博士毕业于新加坡南洋理工大学计算机学院。他的博士研究课题专注于高效低成本的机器学习和计算机视觉算法,目前的研究方向是大模型时代下的多模态模型与AIGC模型。曾在CVPR、ICCV、NeurIPS、TPAMI等顶级国际会议期刊上发表多篇学术论文。目 录CONTENTS1.背景:大语言模型与 Agent2.难点与挑战3.
2、用户界面与人机交互4.基于 GUI 的智能体 AppAgentLarge language models as agentPrompt:Imagine that you are a home robot,and is asked to go to the kitchen to fetch something from the fridge.The image below shows your current position.Please plan your next action.GPT-4V:My next action would be turn right and move forwa
3、rd towards the hallway,as I can see that the kitchen is likely located in that direction.I would then continue to navigate through the hallway until I reach the kitchen and locate the fridge.Large language models as agentPrompt:Now you are at the kitchen,this is what you can see from there.Can you s
4、ee the fridge from here?What is your next action?GPT-4V:From this position,I can see the fridge on the right side of the kitchen.My next action would be to move forward and turn slightly to the right to approach the fridgeLarge language models as agentPrompt:Imagine that you are a robot operating a
5、computer You are asked to find a detailed recipe of Mapo Tofu.Below is what you see on the computer screen,predict your next moveGPT-4V:Next move:I would move the mouse to the Google Chrome icon located on the left side of the screen,approximately 1/3 of the way down from the top.I would then click
6、on the icon to open the web browser.DecisionPerceptionActionChallengesLimitation of text-only input in current agent designControl alone is a big disciplineInsufficient knowledge in LLM&high learning costA Simple Agent PrototypeSignificance of InterfaceAdvantagesActions have clear semantic meaning,p