1、 Building Resilient GPU Fabrics for Demanding AI/ML WorkloadsJag BrarKannan RajOCI ArchitectRishabh Vardhan HarikrishnanPrincipal EngineerOCI VP&Distinguished EngineerOctober 16,2025The following is intended to outline our general product direction.It is intended for information purposes only,and ma
2、y not be incorporated into any contract.It is not a commitment to deliver any material,code,or functionality,and should not be relied upon in making purchasing decisions.The development,release,timing,and pricing of any features or functionality described for Oracles products may change and remains
3、at the sole discretion of Oracle Corporation.Safe harbor statement2Copyright 2025,Oracle and/or its affiliates|Confidential:Internal/Restricted/Highly RestrictedAgenda3Copyright 2025,Oracle and/or its affiliates|Confidential:Internal/Restricted/Highly Restricted12345678OCI AI FabricsAI Workloads are
4、 differentTyranny of large numbersBuilding and Validating Resilient FabricsLink Flaps and impactFailure modes and mitigationHow can the industry help?Look aheadAI InfrastructureKey Elements of AI InfrastructureCopyright 2025,Oracle and/or its affiliates|Confidential:Internal/Restricted/Highly Restri
5、cted5GPUsMemoryPowerCoolingStorageNetworkCopyright 2025,Oracle and/or its affiliates|Confidential:Internal/Restricted/Highly Restricted6AI Workloads!=Regular WorkloadsAI Infrastructure!=Regular InfrastructureAI workloads are differentCopyright 2025,Oracle and/or its affiliates|Confidential:Internal/
6、Restricted/Highly RestrictedNetwork usage is clumpyFewer network flowsBigger and faster flowsNetwork link failures take 10s of seconds to recoverFailures have significant impact on AI trainingAI workloads are inherently synchronized7AI TrainingCopyright 2025,Oracle and/or its affiliates|Confidential