1、以卓越性价比释放开放大模型潜能TPU 上的推理优化全解杨国强:谷歌云/AI Infra 架构师目录01TPU TPU 简介与版本演进TPU TPU 的演进02在 TPU TPU 上部署开放模型的关键技术HTTP/gRPC responseHardware(TPU)Inference ServerQueryQueueSchedulerInference EngineBatchingModelQueryResponseUserApplicationMetricsQueryHTTP/gRPC request推理架构图(框架:vllm/vllm/Max*/Max*/JetStream/JetStrea
2、m/PathwaysPathways)PyTorch Based LLM(llama,DeepSeek)vllmJAX Based LLM(Gemma)Jetstream(MaxText)DiffusionMaxDiffusion目标:优化 KV 缓存内存利用率,实现高吞吐量模型服务洞见:使用虚拟内存分页以减少内存浪费vllmvllm 简介vllm on TPU vllm on TPU 时间线新特性:Executor PerformancePrefillPrefillDecodeDecodePrefill token redundancyMulti-query Attention(TPU-fr
3、iendly)Prefix cachingNo parallelism in prefill/decodeRagged Paged Attention(TPU-friendly)Chunked prefillBottleneckOptimizationBenefitSlow KV cache writesSparsecore offload+Cache Layout RedesignFaster prefill/decodeAvailableAvailableAvailableMulti-Lora+Prometheus MetricsMultimodal model supportV1 ref
4、actorAvailableAvailableAvailableAvailableQuantized modelsAvailableAPC APC(自动前缀缓存)重复利用先前生成的计算结果,消除冗余处理。缓存在分层的内存中(HBM/主机内存)。典型的应用场景:多轮对话,针对技术文档/书籍或者视频的重复问答针对常见提示词前缀的重复请求减小 TTFT 提高吞吐量并最终达到降本增效的目的缓存命中的业务场景下显著提升系统吞吐量https:/ Prefill)(Chunked Prefill)连续批处理(Continuous Batching)(Continuous Batching)03TPUTPU
5、硬件特性和 GPU GPU 有什么不同v6e pod v6e pod 支持万亿规模参数模型Googles highGoogles high-speed speed interconnectinterconnect makesmakesCloud TPU Pods AI SupercomputerCloud TPU Pods AI SupercomputerAs easy as using a single computerAll-reduceby hardware16 chips16 chipsVersatile and high performance model servingVersat
6、ile and high performance model servingRange of shapes to match model size and complexity*Google Internal Data.August 2023.Batch size=1.Multi-head attention based decoder only language models:prefix length=2048,decode steps=256,beam size=32 for samplingVM ShapeVM Shape12x22x44x44x88x88x1616x16Chips/S