1、GuohanLuPartner Software Engineering Manager,MicrosoftWith contributions fromAhmed Abdelsalam,CiscoRita Hui,MicrosoftChangrong Wu,MicrosoftSRv6 uSID for HyperscalersSDN use casesMicrosoft Global Cloud NetworkSONiC Is Powering Cloud At ScaleIn Microsoft 90%T0 switches running SON 80%T1 switches runni
2、ng SONStarted rolling out T2 with SONiin 2024New Traffic Pattern:Small number of large flowsPeriodic bursts of data sent synchronouslyChallenges:Traditional passive ECMP-based load balancing mechanisms suffered from low entropy problem.Failures of communications in LLM training are costly An epoch o
3、f training is blocked until the synchronized collective communications of last epoch finishesGPU hours are expensive resources because of tight supplyIf an ongoing training job crashes,all progress since the last checkpoint may be lost.Artificial Intelligence in the CloudRaising the Bar for Hypersca
4、le Datacenter NetworksHost CPU,NIC,SSDGPUGPUGPUGPUHost CPU,NIC,SSDGPUGPUGPUGPUHost CPU,NIC,SSDGPUGPUGPUGPUMachine LearningDeep LearningRecommender SystemNatural LanguageProcessingComputerVisionAI Workloads/ApplicationsT0RLeafLeafT0RT0RAt the backend of a Data CenterProvides fined-grained network con
5、trol based on source routingEnables path enumeration for traffic managementSRv6 for Backend NetworkIntegration with AI workloads flow scheduling provides optimal network performanceAllow source to quickly reroute upon path failures or congestion5Plane 1Plane 2Plane 3Plane 4SRv6 for 2-layer topology
6、using uSID 01T102T101T0usid0 xd001 0 xd002fcbb:bbbb:d100:d001:d1e0:00a0:0 xd100180T1224T00 xd0b30 xd1e0DstIPv6SRv6 packet from NIC32-block1st hop2nd hop3rd hop4th Hopnicnic00a0fcbb:bbbb:d100:d002:d1e0:00a0:32-block1st hop2nd hop3rd hop4th HopWhen congestion or failure is detected