1、Emel GoksuMitigating Silent Data Corruption:Industry-Academia Collaboration&ProgressMitigating Silent Data Corruption:Industry-Academia Collaboration&ProgressEmel Goksu,MetaARTIFICIAL INTELLIGENCE(AI)SDC represents a distinctive and challenging class of errors,difficult to detect,model,and mitigate.
2、Resolution can be extremely challenging,often requiring months of debugging.Impact can be significant at scale,becoming increasingly relevant as data centers and clusters expand to handle growing AI workloads.Not rare anymore-no single root-cause:Meta-.We observe that CPU SDCs are orders of magnitud
3、e higher than soft-error based FIT simulations.Google-.we observe on the order of a few mercurial cores per several thousand machines.Process marginalitiesDesign errorsDegradation&agingTest coverage&test escapesSoft error rate(SER)or single event upset(SEU)Silent Data Corruption(SDC)Drive solutions
4、and best practices that prevent and detect SDCs.Create awareness about SDC challenges across the computing community.Partner&engage with the academic community to actively address growing SDC challenges.OCP Server Component Resilience Working Group:Tejasvi ChakravarthyHarish DixitEmel GoksuRob Chapp
5、ellNishant GeorgeThiago MacieiraSankar GurumurthyVilas SridharanLisa MinwellAmber HuffmanBharath ParthasarathySpecification 1.0:released in 2024Test Input&OutputPart HistoryMetricsTest Framework&Flowhttps:/www.opencompute.org/documents/external-ver-1-0-open-compute-specification-server-component-res
6、ilience-sdc-workstream-docx-pdf What is next?AI Developer HandbookAI:Defining the core challengeHow can the AI community ensure that subtle hardware errors manifesting as SDC do not undermine the integrity,accuracy,and trustworthiness of models deployed at scale,given the unique workload characteris