《AI织物流体工程和平衡方案分析.pdf》由会员分享,可在线阅读,更多相关《AI织物流体工程和平衡方案分析.pdf(20页珍藏版)》请在三个皮匠报告上搜索。
1、Analysis of Flow engineering and Load Balancing Options on AI FabricsKamini SanthanagopalanProduct Management,BroadcomDanny Hanson(CCIE#4482)Product Management,SupermicroAnalysis of Flow engineering and Load Balancing Options on AI FabricsDanny Hanson(CCIE#4482),Product Management,SupermicroKamini S
2、anthanagopalan,Product Management,BroadcomOCP SPECIAL FOCUS:ARTIFICIAL INTELLIGENCE(AI)AI Cluster Network AttachFocus for TodayUse Case:Backend GPU Fabrics for AI/ML workloadsTarget Deployments:Backend Fabric for AI ethernet fabrics Key BenefitsSolution:oAI training GPU clusters using TH4 or TH5 for
3、 scale-out fabricsFeatures:oROCEv2oECMP Enhancements(various hashing mechanisms,CLI configurability)oDynamic Load Balancing(DLB)oCut-thru switchingSimple and Cost effectiveoSimple 2 stage fabricsoSupports Any-rail architecture(Rail-only,Multi-rail,non-Rail)oNo proprietary technology,all EthernetoHig
4、h radix Merchant silicon switchesoODM/ODM hardware oNo vendor lock-inSpine-1Spine-32Leaf-1Leaf-8Leaf-56Leaf-64GPU ServerNetworking SwitchSoftwareThor 2400GTomahawk5Atlas Example Deployment:Building a 2048 GPU Cluster for AI TrainingWhat Makes AI Networking Unique?GPU to GPU Communication Drives High
5、 Bandwidth UtilizationHigh bandwidth flowsFewer flows,but Elephant flowsRDMA dominant trafficSynchronized and bursty trafficLink Saturation happens in micro-secondsTraining jobs run for long periods of timeTail Latency impacts JCTComputeSynchronizeCommunicateNon-Rail Optimized(for cabling optimizati
6、on)Network Flows:Topology Optimization Rail Optimized(for traffic optimization)Rail Only(for switch optimization)Typical 2 Tier and 3 Tier FabricsRack Optimized 2 TierRack Optimized 3 TierIn Traditional Data Center FabricLeaf-A1Leaf-B1Spine-S1Spine-S2Spine-S8Leaf-A2Leaf-B2H1H2Hn.HxH1H2Hn.HxA1D2D1W1A