1、Pooja GupteAlert Fatigue in Hyperscale Environments:A Metrics-Based Approach to Signal Tuning in Open NetworksAlert Fatigue in Hyperscale Environments:A Metrics-Based Approach to Signal Tuning in Open NetworksPooja GupteNetworkingIts 2.14 am andYou are on-callThe Problem:Alert FatigueScale of Hypers
2、cale cloudInternetWANRegionRegionDC NetworkRetail,Enterprise,Media&Entertainment customersBackbone Transport,Routing Domain,Resiliency&Redundancy50 geographic regions worldwideMillionsphysical network devicesWhy Hyperscale Makes It WorseFailuresWhy Hyperscale Makes It WorseTelemetry lag,correlation
3、failure,configuration driftPort failure,Misconfigured VLAN,Control plane crashNIC Failure,OS/Driver crash,ServerToTor link flapLink congestion,Spine switch outage,packet drops due to buffer exhaustionWhen One Small Failure Becomes a StormWhen One Small Failure Becomes a StormThe NoiseWhen One Small
4、Failure Becomes a StormIntroducing the solutionMetrics like TTD,TTM and TTN help focus on what mattersThe ShiftEvent start time to alert Creation timeTime it took to mitigate issue and stop customer impactTime it took to notifythe customer about theimpact of this eventon their workloadsTTDTime to De
5、tect TTMTime to mitigateTTNTime to notifyMetrics Driven SolutionTelemetry flowToR,Spine,Fabric switchesPort counters,error rates,link stateStatistical/ML DrivenAI/Agentic workflowsMetrics Driven Solution:TTDtimestampdcrackdevicelayeralert_nameseverity2025-08-21 10:00:00DC-01R04Server-213ServerPacket
6、 retransmission errorswarning2025-08-21 10:04:00DC-01R04Server-219ServerHigh latency observedwarning2025-08-21 10:14:00DC-01R04ToR-17ToRLink flap detectedcritical2025-08-21 10:15:00DC-01R04ToR-17ToRInterface errors risingmajor2025-08-21 10:18:00DC-01R04Spine-05SpineNorthbound congestionwarning2025-0