1、 2025,Amazon Web Services,Inc.or its affiliates.All rights reserved.Scaling Observability for the AI Era:from GPUs to LLMsRyan PeirceHe/HimSales EngineerChronosphereA I M 1 2 1-S1.Intro to AI observability 2.How to monitor non-deterministic agents and products3.Model training and GPU bottlenecks 4.I
2、nference hosting and AI outages What were going to talk about:We hope you leave with:An understanding of what AI observability is A headstart on your AI observability strategy The knowledge of how to avoid common AI risks and problems 20 mins well spent Feature BuildersAdding AI features to existing
3、 products and workflowsAI-NativesBuilt entirely around AI capabilities their core product IS the AIModel BuildersBuild,train,and host the large language models that everyone else usesGPU ProvidersProvide GPU compute infrastructure that powers all AI workloadsWhats the market doing with AI?AI Chatbot
4、Observability was already hard and AI just adds complexityA whole new set of AI-specific challengesAll the challenges of large scale Cloud Native workloadsObservability Challenges for AI WorkloadsExisting CN ChallengesMassive scaleBillions of requests,petabyte data volumesMission-critical reliabilit
5、yZero-downtime expectationsHigh performanceSub-second response requirementsSystem and troubleshooting complexityMicroservices,distributed architectures,correlation Observability costs and data volumeTool sprawl,data retention,license fees,data growthHigh cardinalityInfinite label combinations,dimens
6、ion explosionNew AI-Specific ChallengesModel behavior issuesDrift,bias,hallucinations,toxicityToken economicsUsage tracking,cost optimization,budget overrunsComplex dependenciesMulti-step workflows,RAG pipelines,agent chainsGPU infrastructureUtilization,queuing,resource contentionModel performanceLa