弹性 AI：构建容错 AI 系统.pdf

上传人： c**

编号：464892

2025-01-12

PDF 35页 2.17MB

《弹性 AI：构建容错 AI 系统.pdf》由会员分享，可在线阅读，更多相关《弹性 AI：构建容错 AI 系统.pdf（35页珍藏版）》请在三个皮匠报告上搜索。

1、Dan RabinovitsjVP,Engineering MetaResilient AIBuilding Fault-Tolerant AI SystemsArtificial intelligence(AI)is having quite a momentAI-enabled creation toolsText-to-image generationA hedgehog playing chessLarge Language Models(LLMs)Llama 3.1Source:Meta for Business.Culture Rising:2023 Trends Report.2

2、023.pushed our model training tonew heights,leveraging a significantly optimized full training stack16K H100 GPUsused to train Llama 3.1 405B15T tokensTRAINED AT UNPRECEDENTED SCALEThe Challenge of Scale:Llama 3s Infrastructure6K clustersJob size:128-512 GPUs202216-24K clustersJob size:16K GPUs2023A

3、I jobs at scale:massive change in 2023AI jobs at scale:TodaySoftware InfraPhysical InfrastructureLlama2024Training Scale is Not Linear!Not scaling linearlyThroughput#of GPUsOne Small StepInterruptions302010025k50k75k100k250k500k750k1 millionInterruptions per HourNumber of GPUsMore GPUs equalsMore Fa

4、iluresRoadmap to Resilient AI:Metrics Driven OutcomesEffective Training TimeE2E Job Restart Overhead+Lost Training ProgressTotal Expected Training Time1 minusResilient AI:Building Fault-Tolerant AI Systems AI Infrastructure is fundamentally differentTraditional InfrastructureAI Training Infrastructu

5、reTime to RepairDegradation SignatureTraffic FlowsDistributed Comms OrchestratorWorkloadsImpact of FailuresDaysLinear/GracefulMiceN/AMillionsMinimalHoursNon-Linear/SuddenElephantCritically NecessaryOneSevereThe primary causes of these interruptions include:Hardware ReliabilityGPUs are inherently les

6、s reliable than CPUs,and a single GPU failure can halt the entire training process.Network ComplexityThe network infrastructure is complex,and debugging network issues is time-consuming.Software ConfigurationGPU-related software requires extensive configuration,which can be error-prone.Hardware fail

弹性 AI：构建容错 AI 系统.pdf

相关报告