《掌握 Amazon SageMaker AI 上的 LLM 推理.pdf》由会员分享,可在线阅读,更多相关《掌握 Amazon SageMaker AI 上的 LLM 推理.pdf(87页珍藏版)》请在三个皮匠报告上搜索。
1、 2025,Amazon Web Services,Inc.or its affiliates.All rights reserved.2025,Amazon Web Services,Inc.or its affiliates.All rights reserved.Any model can learn.Only deployed models can earn.Inference is the AI Endgame 2025,Amazon Web Services,Inc.or its affiliates.All rights reserved.2025,Amazon Web Serv
2、ices,Inc.or its affiliates.All rights reserved.Ioan CatanaSr Specialist Solutions Architect,AI/MLChristian KamwangalaSpecialist Solutions Architect,AIMLMastering LLM inference on Amazon SageMaker AIA I M 3 3 2 1 2025,Amazon Web Services,Inc.or its affiliates.All rights reserved.AgendaChallenges with
3、 hosting LLMs at scaleSageMaker AI inference recapInference optimization techniques on Amazon SageMaker AIActions and resources 2025,Amazon Web Services,Inc.or its affiliates.All rights reserved.Why inference matters590%10%PredictionTrainingSpendPredictions drive complexity and cost in production90%
4、Prediction10%Training 2025,Amazon Web Services,Inc.or its affiliates.All rights reserved.How LLM inference works6KV attention matricesPROMPT:“Who created Alexa devices in 2014?”PREFILLATTENTION CACHEDECODEWhocreatedAlexadevicesin2014?UPDATE KEY VALUE CACHEREAD K/V CACHEAmazon 2025,Amazon Web Service
5、s,Inc.or its affiliates.All rights reserved.How LLM inference worksCOMPUTE BOUNDMEMORY BANDWIDTH BOUNDKV attention matricesPROMPT:“Who created Alexa devices in 2014?”PREFILLATTENTION CACHEDECODEUPDATE KEY VALUE CACHEREAD K/V CACHEWhocreatedAlexadevicesin2014?Amazon 2025,Amazon Web Services,Inc.or it
6、s affiliates.All rights reserved.Transformers are slow!Autoregressive decoding,long input/output sequences,memory IO intensiveLarge memory footprintHundreds of billions of model parameters often exceeding single accelerator chips memory.Performance tuningPerformance tuning requires subject matter ex