1、Detecting and overcoming GPU failures in ML training,Who we are,3,Pioneering a new way to solve self-driving with Embodied AI,Wayves AI technology empowers vehicles to perceive,predict and progress through dynamic environments.,Sarah BelghitiML PLatform Engineer,4,Ganeshkumar AshokavardhananSoftware
2、 Engineer,Azure Kubernetes Service,Deploy and scale containers on managed Kubernetes,Simplify Kubernetes operations,build cloud-native apps,and innovate with AI and open-source technology.,Agenda,Introduction,GPU failures,Demo,Detection,Resolution,Impact on ML workloads,Distributed systems,Enables m
3、ulti-node training,Job scheduling tools(Volcano,Kubeflow,Kueue),Gang schedulingPlugin for AI frameworks,GPU device plugin/operator,Easy to manage access to relevant hardware components,Why K8s for ML training?,Introduction,What are GPU failures?,GPUs are complex pieces of hardware,with a much higher
4、 likelihood of failure than CPUs.The Llama 3 Herd of Models(Meta,July 2024):,GPU issues:58.7%unexpected interruptions,GPU failures,What are GPU failures?,Commonly affected components,GPU failures,GPU,GPU memory,Networking,GPU driver,Impact on ML workloads,Example of a H100 node(8 GPUs):Public price:
5、2000$/dayFor a job that normally runs for 3 days,6X slower:+$30,000,Gradient synchronisation,Impact on ML workloads,Failures,Hangs,Slow downs,EXPENSIVE$,Avoid workloads from being scheduled on faulty nodesRemove jobs from faulty nodes during training“Repair”or replace faulty nodes,Detection,Goals,De
6、tection,Solution 1:node readiness-detecting failure before the job starts,Runs as init container in the training job podsExample:GPU bandwidth,NCCL tests,GPU errors,Health checks succeed,Health checks fail,The node is tainted and the pod terminates.,Training container can start.,Benefit:workloads do