《使用 SRv6 构建源路由 AI 后端网络.pdf》由会员分享,可在线阅读,更多相关《使用 SRv6 构建源路由 AI 后端网络.pdf(12页珍藏版)》请在三个皮匠报告上搜索。
1、GuohanLuPartner Software Engineering Manager,MicrosoftWith contributions fromAhmed Abdelsalam,CiscoRita Hui,MicrosoftChangrong Wu,MicrosoftSRv6 uSID for HyperscalersSDN use casesMicrosoft Global Cloud NetworkSONiC Is Powering Cloud At ScaleIn Microsoft 90%T0 switches running SON 80%T1 switches runni
2、ng SONStarted rolling out T2 with SONiin 2024New Traffic Pattern:Small number of large flowsPeriodic bursts of data sent synchronouslyChallenges:Traditional passive ECMP-based load balancing mechanisms suffered from low entropy problem.Failures of communications in LLM training are costly An epoch o
3、f training is blocked until the synchronized collective communications of last epoch finishesGPU hours are expensive resources because of tight supplyIf an ongoing training job crashes,all progress since the last checkpoint may be lost.Artificial Intelligence in the CloudRaising the Bar for Hypersca
4、le Datacenter NetworksHost CPU,NIC,SSDGPUGPUGPUGPUHost CPU,NIC,SSDGPUGPUGPUGPUHost CPU,NIC,SSDGPUGPUGPUGPUMachine LearningDeep LearningRecommender SystemNatural LanguageProcessingComputerVisionAI Workloads/ApplicationsT0RLeafLeafT0RT0RAt the backend of a Data CenterProvides fined-grained network con
5、trol based on source routingEnables path enumeration for traffic managementSRv6 for Backend NetworkIntegration with AI workloads flow scheduling provides optimal network performanceAllow source to quickly reroute upon path failures or congestion5Plane 1Plane 2Plane 3Plane 4SRv6 for 2-layer topology
6、using uSID 01T102T101T0usid0 xd001 0 xd002fcbb:bbbb:d100:d001:d1e0:00a0:0 xd100180T1224T00 xd0b30 xd1e0DstIPv6SRv6 packet from NIC32-block1st hop2nd hop3rd hop4th Hopnicnic00a0fcbb:bbbb:d100:d002:d1e0:00a0:32-block1st hop2nd hop3rd hop4th HopWhen congestion or failure is detected