《弹性 AI:构建容错 AI 系统.pdf》由会员分享,可在线阅读,更多相关《弹性 AI:构建容错 AI 系统.pdf(35页珍藏版)》请在三个皮匠报告上搜索。
1、Dan RabinovitsjVP,Engineering MetaResilient AIBuilding Fault-Tolerant AI SystemsArtificial intelligence(AI)is having quite a momentAI-enabled creation toolsText-to-image generationA hedgehog playing chessLarge Language Models(LLMs)Llama 3.1Source:Meta for Business.Culture Rising:2023 Trends Report.2
2、023.pushed our model training tonew heights,leveraging a significantly optimized full training stack16K H100 GPUsused to train Llama 3.1 405B15T tokensTRAINED AT UNPRECEDENTED SCALEThe Challenge of Scale:Llama 3s Infrastructure6K clustersJob size:128-512 GPUs202216-24K clustersJob size:16K GPUs2023A
3、I jobs at scale:massive change in 2023AI jobs at scale:TodaySoftware InfraPhysical InfrastructureLlama2024Training Scale is Not Linear!Not scaling linearlyThroughput#of GPUsOne Small StepInterruptions302010025k50k75k100k250k500k750k1 millionInterruptions per HourNumber of GPUsMore GPUs equalsMore Fa
4、iluresRoadmap to Resilient AI:Metrics Driven OutcomesEffective Training TimeE2E Job Restart Overhead+Lost Training ProgressTotal Expected Training Time1 minusResilient AI:Building Fault-Tolerant AI Systems AI Infrastructure is fundamentally differentTraditional InfrastructureAI Training Infrastructu
5、reTime to RepairDegradation SignatureTraffic FlowsDistributed Comms OrchestratorWorkloadsImpact of FailuresDaysLinear/GracefulMiceN/AMillionsMinimalHoursNon-Linear/SuddenElephantCritically NecessaryOneSevereThe primary causes of these interruptions include:Hardware ReliabilityGPUs are inherently les
6、s reliable than CPUs,and a single GPU failure can halt the entire training process.Network ComplexityThe network infrastructure is complex,and debugging network issues is time-consuming.Software ConfigurationGPU-related software requires extensive configuration,which can be error-prone.Hardware fail