1、Damien Chong,MetaHemal Shah,BroadcomScaling OCP NIC to 1.6T and beyond for AIScaling OCP NIC to 1.6T and beyond for AIDamien Chong,Hardware Tech Lead,MetaHemal Shah,Distinguished Engineer and Architect,BroadcomSERVER:AI HW SW CO-DESIGN/NIC/HPCThis presentation discuss:NIC in AI Backend NetworkNIC 1.
2、6T+CharacteristicsPath&challenges to OCP NIC 1.6T and beyondPreviewAI Infrastructure Network ConnectivityScale OutNetworkScale UpNetworkInternalConnectivityCPUXPUNICNVMeSSDNICXPUCPUNVMeSSD.High-Bandwidth :800G and aboveLarge scale:100K-1M XPUsMessaging SemanticsUltra Low LatencySupports Peer-2-Peer
3、Data TransferXPU-XPU ConnectivityMemory SemanticsNICs in AI systemsCPUNICCPUNICCPU Front-End Network-Send&Receive Data-Pre-process Data-Schedule JobsGPU Scale-Out Network-Parallel GPU-GPU Compute beyond single rack-Model Training*GPU Scale-Up Network within rack typically not by NIC is not part of d
4、iscussion todayGPUGPUNICGPUGPUGPUGPUGPUGPUNICNICNICZoom into Front-End NIC for AI systemsCPUNICCPUNICCPU Front-End Network-Send&Receive Data-Pre-process Data-Schedule JobsMedium traffic intensity satisfy with 400G/800G NIC that is well supported by OCP NIC SFF/TSFFNext-gen expand to 1.6T and/or Liqu
5、id CoolingZoom into Back-End NIC for AI systemsGPU Scale-Out Network-AI racks are becoming increasingly dense because scale-up in-rack network provide much higher bandwidth compared to scale-out rack-to-rack network-Dense AI rack also squeeze area available for Scale-out network solution-Desire&dema
6、nd high GPU-to-GPU interconnectivity drive high Scale-out bandwidth*GPU Scale-Up Network within rack typically not by NIC is not part of discussion todayBandwidth per Area efficiency is ImportantGPUGPUNICGPUGPUGPUGPUGPUGPUNICNICNICPCIe Gen 6 and above host interface(x16 or x32 or x48 or x64 lanes co