《英特尔Gaudi 3人工智能加速器:专为人工智能训练和推理而设计.pdf》由会员分享,可在线阅读,更多相关《英特尔Gaudi 3人工智能加速器:专为人工智能训练和推理而设计.pdf(16页珍藏版)》请在三个皮匠报告上搜索。
1、August 2024Intel Gaudi 3 AI Accelerator:Architected for Gen AI Training and InferenceRoman Kaplan,Ph.D.Principal AI Performance Architect Intel CorporationAugust 20242Product ParameterGaudi Gaudi 2Gaudi 3TDP(OAM)400 W600W900W(Air)/1200W(Liquid)Peak Compute(BF16)60 TFLOPs432 TFLOPs1835 TFLOPsHBM Capa
2、city32 GB96 GB128 GBPeak HBM BW900 GB/s2.46 TB/s3.67 TB/sPeak PCIe BW(bi-directional)64 GB/s64 GB/s128 GB/sEmbedded NIC BW(bi-directional)2 Tb/s4.8 Tb/s9.6 Tb/s(16nm)(7nm)(5nm)Gaudi Product GenerationsIntel Gaudi 3 acceleratorIntel Gaudi 2 acceleratorIntel Gaudi accelerator3OAM:Open Compute Platform
3、 Acceleration Module 2 compute dies connected over an interposer bridge8 HBM2e stacksUp to 900W with air coolingUp to 1200W with liquid coolingPCIe Gen5 x1624x 200GbE RoCE via 48 112G PAM4 SerdesIntel Gaudi 3 AI Accelerator OAMIntel Gaudi 3 AI Accelerator OAM4Intel Gaudi 3 Spec and Block DiagramFeat
4、ure/ProductIntel Gaudi 3 AcceleratorBF16 Matrix TFLOPs1835FP8 Matrix TFLOPs1835BF16 Vector TFLOPs28.7MME Units 8TPC Units 64HBM Capacity128 GBHBM Bandwidth3.67 TB/sOn-die SRAM Capacity96 MBOn-die SRAM Bandwidth(L2 Cache)12.8 TB/sNetworking1200 GB/s bidirectionalHost InterfacePCIe Gen5 x16Host Interf
5、ace Peak BW128 GB/s bidirectionalMedia EngineRotator+14 Decoders(HEVC,H.264,JPEG,VP9)Intel Gaudi 3 AI Accelerator Block Diagram5Uniform memory mapping of HBM by MMUCompute is clustered:4 Deep Learning Cores(DCORE)Each DCORE:2xMME,16xTPC,24MB cacheL2 and L3 data caches:L2:Allocated only in DCORE cach
6、e L3:Uniformly distributed across all DCORE cachesMedia accelerators:Decoder and RotatorNW Sub-system containing:24 RDMA NIC 200GbE ports(details in a separate slide)Control has a separate block and NOC fabricIntel Gaudi 3 AI Accelerator Block DiagramArchitecture in Depth6Executes all matrix multipl