1、长文本大模型推理实践长文本大模型推理实践以以 KVCache 为中心的分离为中心的分离式推理架构式推理架构演讲人:唐飞虎月之暗面 研发工程师 开发者关系负责人目录01长文本推理的瓶颈02长文本推理的优化0304上下文缓存的应用Mooncake 的实践长文本推理的瓶颈RAGPros.无需额外训练速度快成本低工程方案成熟可设计多级检索方案Cros.Embedding 召回效果直接影响模型回答效果无法处理复杂逻辑对多模态支持不足Long-ContextPros.无需额外训练上下文兼顾更全面可处理复杂逻辑和依赖Cros.贵且慢长度有限Long-ContextPros.无需额外训练上下文兼顾更全面可处理
2、复杂逻辑和依赖Cros.贵且慢长度有限长文本长文本:有点贵有点贵长文本长文本:有点慢有点慢Long-Context 性能瓶颈并发性能随着上下文长度的增加而反比下降。预填充延迟随上下文长度的增长而呈平方级别的增长。解码延迟和上下文切换开销随上下文长度的增加而线性增加。Long-Context 性能瓶颈并发性能随着上下文长度的增加而反比下降。预填充延迟随上下文长度的增长而呈平方级别的增长。解码延迟和上下文切换开销随上下文长度的增加而线性增加。长文本推理的优化Long-Context 推理优化硬件 A100 Memory Hierarchy机器学习工程 FlashAttention vLLM模型架构
3、 MoE Speculative DecodingLong-Context 推理优化LayerConfident Adaptive Language Modeling,2022CoLT5:Faster Long-Range Transformers with Conditional Computation,2023LayerSkip:Enabling Early Exit Inference and Self-Speculative Decoding,2024You Only Cache Once:Decoder-Decoder Architectures for Language Model
4、s,2024HeadGQA:Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints,2023Long-Context 推理优化HeadRetrieval Head Mechanistically Explains Long-Context Factuality,2024DeepSeek-V2:A Strong,Economical,and Efficient Mixture-of-Experts Language Model,2024HidenKIVI:A Tuning-Free Asymm
5、etric 2bit Quantization for KV Cache,2024WKVQuant:Quantizing Weight and Key/Value Cache for Large Language Models Gains More,2024Long-Context 推理优化TokenH2O:Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models,2023Model Tells You What to Discard:Adaptive KV Cache Compression
6、 for LLMs,2023Dynamic Memory Compression:Retrofitting LLMs for Accelerated Inference,2024SnapKV:LLM Knows What You are Looking for Before Generation,2024TriForce:Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding,2024Mooncake 的实践一些基本概念Prefill:在预填充阶段。每个新 Token 都取