当前位置:首页 > 报告详情

构建适用于高要求 AI_ML 工作负载的弹性 GPU 架构 [LRN1415].pdf

上传人: Fl****zo 编号:971042 2025-11-08 50页 3.36MB

1、 Building Resilient GPU Fabrics for Demanding AI/ML WorkloadsJag BrarKannan RajOCI ArchitectRishabh Vardhan HarikrishnanPrincipal EngineerOCI VP&Distinguished EngineerOctober 16,2025The following is intended to outline our general product direction.It is intended for information purposes only,and ma

2、y not be incorporated into any contract.It is not a commitment to deliver any material,code,or functionality,and should not be relied upon in making purchasing decisions.The development,release,timing,and pricing of any features or functionality described for Oracles products may change and remains

3、at the sole discretion of Oracle Corporation.Safe harbor statement2Copyright 2025,Oracle and/or its affiliates|Confidential:Internal/Restricted/Highly RestrictedAgenda3Copyright 2025,Oracle and/or its affiliates|Confidential:Internal/Restricted/Highly Restricted12345678OCI AI FabricsAI Workloads are

4、 differentTyranny of large numbersBuilding and Validating Resilient FabricsLink Flaps and impactFailure modes and mitigationHow can the industry help?Look aheadAI InfrastructureKey Elements of AI InfrastructureCopyright 2025,Oracle and/or its affiliates|Confidential:Internal/Restricted/Highly Restri

5、cted5GPUsMemoryPowerCoolingStorageNetworkCopyright 2025,Oracle and/or its affiliates|Confidential:Internal/Restricted/Highly Restricted6AI Workloads!=Regular WorkloadsAI Infrastructure!=Regular InfrastructureAI workloads are differentCopyright 2025,Oracle and/or its affiliates|Confidential:Internal/

6、Restricted/Highly RestrictedNetwork usage is clumpyFewer network flowsBigger and faster flowsNetwork link failures take 10s of seconds to recoverFailures have significant impact on AI trainingAI workloads are inherently synchronized7AI TrainingCopyright 2025,Oracle and/or its affiliates|Confidential

word格式文档无特别注明外均可编辑修改,预览文件经过压缩,下载原文更清晰!
三个皮匠报告文库所有资源均是客户上传分享,仅供网友学习交流,未经上传用户书面授权,请勿作商用。
根据《Building Resilient GPU Fabrics for Demanding AI/ML Workloads》的内容,以下是全文关键点的概括: 1. **AI/ML工作负载的特殊性**: - AI工作负载与常规工作负载不同,对网络依赖性高,网络中断影响大。 - 需要高带宽网络和同步操作。 2. **网络挑战**: - 网络中断(如链路翻转)对AI训练影响巨大。 - 链路翻转由多种因素引起,包括灰尘、温度变化和组件缺陷。 3. **解决方案**: - 需要增强的鲁棒性,包括物理和链路层错误检测和纠正。 - 使用自适应流量负载平衡技术来优化网络流量。 4. **技术要求**: - 需要支持更高速度和更可靠的光学组件。 - 需要新的管理功能,如主动诊断。 5. **结论**: - AI基础设施的可用性管理与传统基础设施不同。 - 链路翻转需要通过去抖动和利用超时窗口进行修复。
**AI训练网络挑战** **如何应对AI网络中断** **大规模AI网络优化策略**
客服
商务合作
小程序
服务号
折叠