《增强人工智能基础设施的光链路可靠性.pdf》由会员分享,可在线阅读,更多相关《增强人工智能基础设施的光链路可靠性.pdf(10页珍藏版)》请在三个皮匠报告上搜索。
1、Susu He,MetaXun Jiao,MetaHan Wang,MetaEnhancing Optical Link Reliability for AI InfrastructureEnhancing Optical Link Reliability for AI InfrastructureSusu He,MetaXun Jiao,MetaHan Wang,Meta OCP SPECIAL FOCUS:ARTIFICIAL INTELLIGENCE(AI)Link reliability is one of the top 4 major contributors of AI job
2、interruptions.Optical link reliability is important in sustaining AI workload efficiency and optimizing TCO.A comprehensive framework is essential for enhancing E2E optical link reliability.BackgroundThe legacy RMA and triage based passive solution upon occurred link failure will NOToMaximize the wo
3、rk done by the clusteroMaximize the Efficient Training Time(ETT)oMinimize the cluster degradation oMinimize the interruptionAs infra continues to scale,if link failure rate is not reduced,it puts increased pressure on serviceability,sparing strategy,recovery efficiency.Problem StatementThe main chal
4、lenge is shifting from passive to proactiveapproach to meet the upcoming 10 x infrastructure reliability and efficiency demands,maximizing TCO.Our methodology enables early failure detection on 100G/Lane links by:Monitor optics performance metrics to enable predictive,data-driven models that detect
5、link degradation before failure;Apply ML models to forecast potential optical link failures.This framework reduces link interruption rates,improving uptime,resilience,and performance in hyperscale AI infrastructure.Existing challenges&methodologyEach link,identified by a unique ID,has a set of featu
6、res,corresponding to the proposed monitoring metrics and is also labeled to indicate whether a link failure occurs within a defined future time window,enabling predictive analysis through machine learning.Sample key metrics monitored as input features:Transmit