《NVIDIA Spectrum-X Network Platform Architecture.pdf》由会员分享,可在线阅读,更多相关《NVIDIA Spectrum-X Network Platform Architecture.pdf(14页珍藏版)》请在三个皮匠报告上搜索。
1、Extending Generative AI to Mainstream Hyperscale CloudNVIDIA Spectrum-X Network Platform ArchitectureBarak Gafni,Network Architect,NVIDIAGilad Shainer,SVP Networking,NVIDIANVIDIA Spectrum-X Network Platform ArchitectureARTIFICIAL INTELLIGENCE(AI)AIXXXXXXXXXXXXXXXXDATA CENTERNorth-SouthEast-WestEast-
2、West for Distributed and Disaggregated ProcessingNorth-South for User-to-Cloud CommunicationsThe Network Defines the Data CenterLoosely Coupled ApplicationsDistributed Tightly-Coupled ProcessingTCP(Low BandwidthFlows and Utilization)RoCE(High Bandwidth Flows and Utilization)High Jitter ToleranceLow
3、Jitter Tolerance(Long Tail Kills Performance)Heterogeneous TrafficAverage Multi-PathingBursty Network CapacityPredictable PerformanceControl/User Access Network(North-South)Traditional EthernetAI Compute Fabric(East-West)Spectrum-XHow Does GPT Workload Look Like?200400How Does GPT Workload Look Like
4、?200400200400LLM NCCL AllReduce Traditional EthernetThe Network Defines the Data CenterNCCL(NVIDIA Collective Communication Library)is the SDK library for AI communications-connects the GPUs and the network for the AI network operationsWorst Job PlacementAverage Job PlacementBest Job PlacementJ1J1Sw
5、itch Switch Switch Switch J1J1J2J2J2J2J1J1Switch Switch Switch Switch J1J2J2J2J2J1J1J2Switch Switch Switch Switch J1J2J1J2J1J2AI-optimized networking for every data centerRoCE Adaptive Routing(local/remote information,a packet granularity)Congestion Control(telemetry probes)Noise Isolation(multi-job
6、s or a single large-scale job)High-frequency telemetry(1000 x)Spectrum-X Brings High-Performance AI to EthernetBlueField-3 SuperNIC16 Arm 64-Bit Cores16 Core/256 Threads Datapath AcceleratorConnectX NIC,DDR memory interface,PCIe switchSpectrum-X800BlueField-3 SuperNICSpectrum-X80