1、Gemini 1.5:Unlocking multimodalunderstanding across millions of tokens ofcontextGemini Team,Google1In this report,we introduce the Gemini 1.5 family of models,representing the next generation of highlycompute-efficient multimodal models capable of recalling and reasoning over fine-grained informatio
2、nfrom millions of tokens of context,including multiple long documents and hours of video and audio.Thefamily includes two new models:(1)an updated Gemini 1.5 Pro,which exceeds the February version onthe great majority of capabilities and benchmarks;(2)Gemini 1.5 Flash,a more lightweight variantdesig
3、ned for efficiency with minimal regression in quality.Gemini 1.5 models achieve near-perfectrecall on long-context retrieval tasks across modalities,improve the state-of-the-art in long-documentQA,long-video QA and long-context ASR,and match or surpass Gemini 1.0 Ultras state-of-the-artperformance a
4、cross a broad set of benchmarks.Studying the limits of Gemini 1.5s long-context ability,we find continued improvement in next-token prediction and near-perfect retrieval(99%)up to atleast 10M tokens,a generational leap over existing models such as Claude 3.0(200k)and GPT-4 Turbo(128k).Finally,we hig
5、hlight real-world use cases,such as Gemini 1.5 collaborating with professionalson completing their tasks achieving 26 to 75%time savings across 10 different job categories,as well assurprising new capabilities of large language models at the frontier;when given a grammar manual forKalamang,a languag
6、e with fewer than 200 speakers worldwide,the model learns to translate English toKalamang at a similar level to a person who learned from the same content.1.IntroductionWe present our latest multimodal models from the Gemini line:Gemini 1.5 Pro and Gemini 1.5Flash.They are members of Gemini 1.5,a ne