《优化GenAI在Amazon EKS上的推理和模型性能.pdf》由会员分享,可在线阅读,更多相关《优化GenAI在Amazon EKS上的推理和模型性能.pdf(18页珍藏版)》请在三个皮匠报告上搜索。
1、 2025,Amazon Web Services,Inc.or its affiliates.All rights reserved.2025,Amazon Web Services,Inc.or its affiliates.All rights reserved.C N S 4 1 9Optimize GenAI inference and model performance on Amazon EKSElamaran Shanmugam(Ela)(he/him)Sr.Specialist Solutions Architect,Containers GFSAWS Eng-Hwa Tan
2、(he/him)Pr.GTM SSA,Containers ASEANAWS 2025,Amazon Web Services,Inc.or its affiliates.All rights reserved.Agenda Inference on EKSKey Challenges in LLM ServingDistributed Inference ArchitectureThe optimization journeyQ/A 2025,Amazon Web Services,Inc.or its affiliates.All rights reserved4Your Inferenc
3、ing todayHow long does it take to load your first token?Is your GPU utilization under 40%while inferencing?20+minsToday 2025,Amazon Web Services,Inc.or its affiliates.All rights reserved5What we want your inferencing to be?Reduce Time To First TokenBest use of your GPUs!Better Model Load TimesBetter
4、 GPU Resource ManagementAdvanced techniquesGetting best out of Inference frameworksToday 2025,Amazon Web Services,Inc.or its affiliates.All rights reserved.Journey to Inference OptimizationModel Loading optimizationGPU Resource ManagementAdvanced ComponentsInference framework considerations 2025,Ama
5、zon Web Services,Inc.or its affiliates.All rights reserved.Key Challenges Inferencing at Production ScalePython Runtime Base execution environmentContainer Image NodePyTorchDeep learning framework HuggingFace Transformers Inference libraryModelweights,configuration file,tokenizerChallenges Long star
6、tup time 20+minutes Less than 40%GPU utilization Unpredictable scaling behavior High costs Slow Token Generation:no optimized kernels Not memory efficient Missing operational features(scaling,monitoring)Appinvoking the model 2025,Amazon Web Services,Inc.or its affiliates.All rights reserved.Inferenc