how-fast-can-your-model-composition-run-in-serverless-inference-zha-nfyuan-nano-jlia-ai-chan-dyags-fog-dong-bentoml-wenbo-qi-ant-group.pptx

上传人：山海

编号：627380

2025-03-21

PPTX 26页 16.92MB

《how-fast-can-your-model-composition-run-in-serverless-inference-zha-nfyuan-nano-jlia-ai-chan-dyags-fog-dong-bentoml-wenbo-qi-ant-group.pptx》由会员分享，可在线阅读，更多相关《how-fast-can-your-model-composition-run-in-serverless-inference-zha-nfyuan-nano-jlia-ai-chan-dyags-fog-dong-bentoml-wenbo-qi-ant-group.pptx（26页珍藏版）》请在三个皮匠报告上搜索。

1、How Fast Can Your Model Composition Run in Serverless Inference?,Fog Dong BentoML&Wenbo Qi Antgroup,Why Model Composition?,Typical AI Apps need more than one model:RAG example,AI Models in a RAG System,Why Serverless?,Cost Efficiency-Pay only for actual compute time used-No idle resource costs,Auto-

2、scaling-Seamless handling of varying workloads-Instant scale-up/down based on demand,Flexibility and Portability-Easy to deploy across different environments-Less infrastructure management,Build&Scale Compound AI System,Application Code,Requirements for Compound AI Apps:Standardized Interfaces and C

3、ommunication Protocols:Unified APIs and data formats for seamless inter-model communication,regardless of individual scaling.Ensures compatibility and effective collaboration across different scaling levels.Independently Scalable Model Architecture:Each model can scale horizontally or vertically ind

4、ependently.Includes decoupled definitions,deployments,and resource allocations.Allows dynamic resource adjustment based on individual model needs without affecting others.Distributed State Management and Data Flow Orchestration:Mechanisms for managing and synchronizing states across differently scal

5、ed instances.Defines efficient data flows between models at various scales,maintaining consistency and integrity during scaling operations.,Application Code,Model&Code,Bento,Build,https:/,source,Build models like microservices,Model Inference:Benchmarking,Best time-to-first-token(TTFT)among all infe

6、rence backends,concurrent user levels,and model parameter sizes.Extensive model architecture and hardware support.Easy to integrate with under 50 lines of Python code.Comprehensive documentation and examples.Robust open source community.,Inference Backends:vLLM,LMDeploy,MLC-LLM,TensorRT-LLM,and TGIM

how-fast-can-your-model-composition-run-in-serverless-inference-zha-nfyuan-nano-jlia-ai-chan-dyags-fog-dong-bentoml-wenbo-qi-ant-group.pptx

相关报告