当前位置:首页 > 报告详情

FBOSS 演进以支持生成式 AI 网络工作负载.pdf

上传人: 明**** 编号:1011419 2025-12-21 21页 1.97MB

1、Evolving FBOSS to support Generative AI Network WorkloadsMeta Platforms Inc.Broadcom Inc.Jasmeet BaggaSoftware Engineer/Meta Platforms Inc.Shrikrishna KhareSoftware Engineer/Meta Platforms Inc.Evolving FBOSS to support Generative AI Network WorkloadsNETWORKINGMehak MahajanSenior Director of Engineer

2、ing/Broadcom Inc.Outline5 Call to Action4 DSF Performance for Gen AI3 Key FBOSS Enhancements for Gen AI2 Key SDK Enhancements for Gen AI1 Disaggregated Scheduled FabricFacebook Open Switching System:FBOSSMetas software stack for managing Network Switches in Metas DCs One of Metas largest network ser

3、vices by deploymentFBOSS uses SAISwitch Abstraction Interface:SAIAn OCP projectOpen source API to control switching elementsVendor independentFBOSSChallengesElephant flows:few extremely large continuous flowsLow entropy:less variation,more likely to cause hash collisionsOscillatory behavior during c

4、ongestionSolution:DSFDisaggregated Scheduled FabricNear-optimal load balancingSmoother bandwidth delivery:credit allocationFlexibility/optionality for endpoints:fabric performs spray/reassemblyNetwork Traffic for AI TrainingNon-blocking Network for 4K GPUsCredit based congestion control,Break packet

5、 into cells and sprayReassembly in hardwareRDSW=Disaggregated Line CardFDSW=Disaggregated Fabric CardChallenges:Generative AI requires much larger number of GPUsJobs spanning multi-K GPUsSolution:Interconnect 4K GPU clusters using Routing and ECMPBut,intra-cluster:non-blocking,inter-cluster:elephant

6、 flows,low entropy,oversubscribed domain*DSF Dual Stage Topology*Hierarchy of Fabric devices,18K GPU non-blocking clusterDefer ECMP decision as late as possibleNetwork Traffic for Generative AI DSF Dual Stage TopologyRDSW=Disaggregated Line CardFDSW=Disaggregated Fabric CardSDSW=2nd stage Fabric Car

word格式文档无特别注明外均可编辑修改,预览文件经过压缩,下载原文更清晰!
三个皮匠报告文库所有资源均是客户上传分享,仅供网友学习交流,未经上传用户书面授权,请勿作商用。
根据报告的内容,全文主要内容概括如下: 1. **FBOSS升级**:Meta Platforms Inc. 和 Broadcom Inc. 正在升级FBOSS,以支持生成式AI网络工作负载。 2. **DSF技术**:采用解耦调度 fabrics (DSF) 技术解决大象流、低熵和振荡行为等挑战,实现近最优负载平衡和更平滑的带宽交付。 3. **性能提升**:DSF双阶段拓扑提供18K GPU非阻塞集群,延迟降低40%以上,对集体作业或工作性能影响最小。 4. **关键增强**:包括用户可配置的SFlow样本、镜像丢弃、增强的调试支持、流量优先级、Hyperport控制平面增强等。 5. **FBOSS增强**:提升VOQ规模、系统端口分配、全局负载平衡等,以支持更大规模的AI工作负载。 6. **性能测试**:DSF性能优于广泛部署的单阶段DSF,提供更大的非阻塞域。 7. **未来工作**:继续开发流量优先级、Hyperport、控制平面增强等功能。
AI网络新突破?" AI网络加速器?" AI网络监控利器?"
客服
商务合作
小程序
服务号
折叠