弹性 AI:构建容错 AI 系统.pdf

编号:464892 PDF 35页 2.17MB 下载积分:VIP专享
下载报告请您先登录!

弹性 AI:构建容错 AI 系统.pdf

1、Dan RabinovitsjVP,Engineering MetaResilient AIBuilding Fault-Tolerant AI SystemsArtificial intelligence(AI)is having quite a momentAI-enabled creation toolsText-to-image generationA hedgehog playing chessLarge Language Models(LLMs)Llama 3.1Source:Meta for Business.Culture Rising:2023 Trends Report.2

2、023.pushed our model training tonew heights,leveraging a significantly optimized full training stack16K H100 GPUsused to train Llama 3.1 405B15T tokensTRAINED AT UNPRECEDENTED SCALEThe Challenge of Scale:Llama 3s Infrastructure6K clustersJob size:128-512 GPUs202216-24K clustersJob size:16K GPUs2023A

3、I jobs at scale:massive change in 2023AI jobs at scale:TodaySoftware InfraPhysical InfrastructureLlama2024Training Scale is Not Linear!Not scaling linearlyThroughput#of GPUsOne Small StepInterruptions302010025k50k75k100k250k500k750k1 millionInterruptions per HourNumber of GPUsMore GPUs equalsMore Fa

4、iluresRoadmap to Resilient AI:Metrics Driven OutcomesEffective Training TimeE2E Job Restart Overhead+Lost Training ProgressTotal Expected Training Time1 minusResilient AI:Building Fault-Tolerant AI Systems AI Infrastructure is fundamentally differentTraditional InfrastructureAI Training Infrastructu

5、reTime to RepairDegradation SignatureTraffic FlowsDistributed Comms OrchestratorWorkloadsImpact of FailuresDaysLinear/GracefulMiceN/AMillionsMinimalHoursNon-Linear/SuddenElephantCritically NecessaryOneSevereThe primary causes of these interruptions include:Hardware ReliabilityGPUs are inherently les

6、s reliable than CPUs,and a single GPU failure can halt the entire training process.Network ComplexityThe network infrastructure is complex,and debugging network issues is time-consuming.Software ConfigurationGPU-related software requires extensive configuration,which can be error-prone.Hardware fail

友情提示

1、下载报告失败解决办法
2、PDF文件下载后,可能会被浏览器默认打开,此种情况可以点击浏览器菜单,保存网页到桌面,就可以正常下载了。
3、本站不支持迅雷下载,请使用电脑自带的IE浏览器,或者360浏览器、谷歌浏览器下载即可。
4、本站报告下载后的文档和图纸-无水印,预览文档经过压缩,下载后原文更清晰。

本文(弹性 AI:构建容错 AI 系统.pdf)为本站 (com) 主动上传,三个皮匠报告文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。 若此文所含内容侵犯了您的版权或隐私,请立即通知三个皮匠报告文库(点击联系客服),我们立即给予删除!

温馨提示:如果因为网速或其他原因下载失败请重新下载,重复下载不扣分。
客服
商务合作
小程序
服务号
折叠