1、Proactive Link Management in AI Networks-Lessons from MetaMeta Platforms IncBruno NovaisProduction Engineer/MetaHarshit Gulati(Presenter)Software Engineer/MetaProactive Link Management in AI Networks-Lessons from MetaNETWORKINGOutlineCall To Action Improved Link Management Traditional Link Managemen
2、t Motivation ContextThe Scale of the Challenge5,000+Optical CircuitsIn a 4k GPU cluster at leaf-spine level10,000+Optical TransceiversRequired for connections100,000+Total OpticsIn large-scale clustersImpact of Link FailuresRetransmission RequiredIncreases latencyPerformance DegradationLarge impact
3、with spraying of trafficJob InterruptionsWorkloads must restart from checkpointsBusiness ImpactCostly downtimeDesign ChallengesBreakout InterfacesSplitting high-speed ports increases failure points.A single 400G port becomes four 100G interfaces with more components.Fabric Interface TechnologiesComp
4、lex technology increases the need for better Signal Integrity and MonitoringManaged Network InterfacesEach additional interface requires monitoring and management.Operations blast radius increase during repairs.Sources of Link FailuresOptical Transceiver IssuesManufacturing defects or degradationSof
5、tware TuningMisconfigured parameters or driver tuningFiber ContaminationDust or debris causing signal attenuationPhysical RepairsIncreased complexity during maintenanceFirmware BugsUndetected issues softwareTraditional Approach to Link ManagementProvisioning Inject traffic from CPU to verify link st
6、ability and absence of CRC errors LiveReact to link flaps or errors and drain the linkRepairRepair the link in its current stateDetects link after they have affected training jobs Determining when to drain is a hard exerciseMarginal linksExample:Flaps once a dayRepeat OffendersEx