当前位置：首页 > 报告详情

尹良升-SGLang 推理引擎高效的开源部署方案.pdf

上传人： Fl****zo 编号：724338 2025-07-01 PDF PDF 34页 2.83MB

该报告所属合集： 2025年AICon全球人工智能开发与应用大会·北京站嘉宾演讲PPT合集

打包下载报告合集

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载报告到电脑，查找使用更方便

VIP专享文档

书签

分享

收藏

已收藏

版权投诉

/34

立即下载

word格式文档无特别注明外均可编辑修改，预览文件经过压缩，下载原文更清晰！

三个皮匠报告文库所有资源均是客户上传分享，仅供网友学习交流，未经上传用户书面授权，请勿作商用。

《尹良升-SGLang 推理引擎高效的开源部署方案.pdf》由会员分享，可在线阅读，更多相关《尹良升-SGLang 推理引擎高效的开源部署方案.pdf（34页珍藏版）》请在三个皮匠报告上搜索。

1、演讲人：尹良升010203040506SGLang Milestones and Features OverviewSpeculative Decoding and Constrained Decoding in SGLangEfficient Design and Implementation of PD DisaggregationLarge-scale EP Support for DeepSeek Blog ReproductionHierarchical Caching Design in SGLang The Ecosystem of SGLangpreencoded.png SG

2、Lang is a fast-serving engine for LLMs and VLMs.Among fully open-source LLM Inference Engines,SGLang currently achieves state-of-the-art(SOTA)performance,and It is the first open-source implementation to nearly match the throughput reported in the official DeepSeek blog at a large scale.Meanwhile,it

3、s elegant,lightweight,and customizable design has attracted wide adoption from academics,big tech companies,and startups.(XAI,NVIDIA,AMD,Baseten,Microsoft,Linkedin,etc.)In on-policy RLHF,inference engines are crucial for efficient policy model execution,and SGLang excels as a high-performance soluti

4、on.01SGLang Milestones and Features OverviewSGLang Milestones and Features 2023/12-2024/02:Initial Motivation,Structured LM Programming,Prefix Caching,and Constrained Decoding 2024/07:Leading Performance among inference engines on Llama3 2024/09:v0.3 Release,7x Faster DeepSeek MLA,1.5x Faster pile,M

5、ulti-Image/Video LLaVA-OneVision 2024/12:v0.4 Release:Zero-Overhead Batch Scheduler,Cache-Aware DP Router,X-Grammar Integration,The Firstto Serve DeepSeek V3.2025/01:SGLang provides day-one support for DeepSeek V3/R1 models on NVIDIA and AMD GPUs with DeepSeek-specific optimizations.(10+companies!)2

6、025/05Firstopen-sourceimplementationofDeepSeekV3/R1expertparallelismwithprefill-decodedisaggregation.Achieves 52.3K in-tok/s,22.3K out-tok/s on 96 GPUs5faster than vanilla TP.SGLang has seen extensive adoption and serves as the dominant inference engine for AMD and the default inferenceengine for xA

本文主要介绍了SGLang，一个针对大型语言模型（LLM）和视觉语言模型（VLM）的高速服务引擎。以下是关键点： 1. **性能表现**：SGLang在开源LLM推理引擎中达到当前最佳性能（SOTA），并在大规模上接近DeepSeek官方博客报告的吞吐量。 2. **设计优势**：SGLang以其优雅、轻量级和可定制的设计受到广泛采用，涵盖学术、大型科技公司及初创公司。 3. **里程碑与功能**： - 2023/12-2024/02：初始动机、结构化LM编程、前缀缓存和受限解码。 - 2024/07：在Llama3上，推理引擎性能领先。 - 2024/12：v0.4发布，首个支持DeepSeek V3的开源实现。 4. **解码优化**：SGLang支持Eagle-2和Eagle-3推测解码，并在Llama 3.1 8B上实现1.6x至2.4x的解码加速。 5. **高效设计与实现**：通过统一负载均衡器、非阻塞KV传输和灵活的API集成，解决了预填充中断和注意力不平衡等问题。 6. **大规模并行支持**：针对密集和稀疏FFN，SGLang优化了扩展性、内存效率和通信开销。 7. **生态系统**：SGLang团队由LMSYS组织孵化，拥有400+贡献者，是AMD默认的推理引擎，并为多家公司提供支持。核心数据引用： - 在96个GPU上实现了52.3K的in-tok/s和22.3K的out-tok/s。 - JSON解码任务中，SGLang与XGrammar结合比其他开源解决方案快10倍。 - 专家并行负载均衡器（EPLB）显著提高了扩展性下的平衡性。

"SGLang性能如何惊艳？" "SGLang如何优化深度学习部署？" "SGLang生态圈有多强大？"

全行业研究报告分享下载平台

0731-84720580
商务合作：really158d
友链申请 (QQ)：1737380874

关于我们

更多

关于我们

三个皮匠报告微信公众号

三个皮匠报告微信小程序

扫码咨询网站充值下载问题

友情链接：

营销自动化亿欧智库微播易阿里妈妈

copyright@2008-2013 长沙景略智创信息技术有限公司版权所有网站备案/许可证号：湘B2-20190120 | 工信部备案号：湘ICP备17000430号-2 | 公安备案号：湘公网安备43010402001071号

客服

小程序

服务号

折叠