《超大规模环境下的告警疲劳:基于指标的开放网络信号调优方法.pdf》由会员分享,可在线阅读,更多相关《超大规模环境下的告警疲劳:基于指标的开放网络信号调优方法.pdf(26页珍藏版)》请在三个皮匠报告上搜索。
1、Pooja GupteAlert Fatigue in Hyperscale Environments:A Metrics-Based Approach to Signal Tuning in Open NetworksAlert Fatigue in Hyperscale Environments:A Metrics-Based Approach to Signal Tuning in Open NetworksPooja GupteNetworkingIts 2.14 am andYou are on-callThe Problem:Alert FatigueScale of Hypers
2、cale cloudInternetWANRegionRegionDC NetworkRetail,Enterprise,Media&Entertainment customersBackbone Transport,Routing Domain,Resiliency&Redundancy50 geographic regions worldwideMillionsphysical network devicesWhy Hyperscale Makes It WorseFailuresWhy Hyperscale Makes It WorseTelemetry lag,correlation
3、failure,configuration driftPort failure,Misconfigured VLAN,Control plane crashNIC Failure,OS/Driver crash,ServerToTor link flapLink congestion,Spine switch outage,packet drops due to buffer exhaustionWhen One Small Failure Becomes a StormWhen One Small Failure Becomes a StormThe NoiseWhen One Small
4、Failure Becomes a StormIntroducing the solutionMetrics like TTD,TTM and TTN help focus on what mattersThe ShiftEvent start time to alert Creation timeTime it took to mitigate issue and stop customer impactTime it took to notifythe customer about theimpact of this eventon their workloadsTTDTime to De
5、tect TTMTime to mitigateTTNTime to notifyMetrics Driven SolutionTelemetry flowToR,Spine,Fabric switchesPort counters,error rates,link stateStatistical/ML DrivenAI/Agentic workflowsMetrics Driven Solution:TTDtimestampdcrackdevicelayeralert_nameseverity2025-08-21 10:00:00DC-01R04Server-213ServerPacket
6、 retransmission errorswarning2025-08-21 10:04:00DC-01R04Server-219ServerHigh latency observedwarning2025-08-21 10:14:00DC-01R04ToR-17ToRLink flap detectedcritical2025-08-21 10:15:00DC-01R04ToR-17ToRInterface errors risingmajor2025-08-21 10:18:00DC-01R04Spine-05SpineNorthbound congestionwarning2025-0