《SambaNova SN40L RDU:突破万亿+参数规模Gen AI计算的壁垒.pdf》由会员分享,可在线阅读,更多相关《SambaNova SN40L RDU:突破万亿+参数规模Gen AI计算的壁垒.pdf(24页珍藏版)》请在三个皮匠报告上搜索。
1、SambaNova SN40L RDU:Breaking the Barrier of Trillion+Parameter Scale Gen AI ComputingRaghu PrabhakarArchitect,SambaNova SystemsHotChips 2024Copyright 2024 SambaNova Systems Inc.SN40L:SambaNovas New Language-Optimized RDU2“Cerulean”Architecture-based Reconfigurable Dataflow Unit1.5 TB High Capacity M
2、emory5nm TSMC3-tier Dataflow Memory1,040 RDU Cores102B Transistors64 GB High Bandwidth Memory520 MB On-Chip Memory638 TFLOPS(bf16)Cerulean SN40L RDUGenerative AI Training and InferenceCopyright 2024 SambaNova Systems Inc.On-Chip SRAM8 GB,PBs per secRDU High Bandwidth Memory 1 TB RDU High Capacity DD
3、R Memory 24 TB1600 GB/s25.6 TB/sHigh throughput inference with caching Low Latency Model Switching(E.g.,0.01s for llama3.1 8B)Dataflow enabled by large On-Chip Memory3SN40L:SambaNovas New Language-Optimized RDU3-tier Memory System with SRAM,HBM,and DDRCopyright 2024 SambaNova Systems Inc.SN40L Chip:
4、Tile Architecture 1040 PCUs and PMUsPCU:Compute unitPMU:Memory unitS:Mesh switchesAGCU:Portal to off-chip memory and IO4Copyright 2024 SambaNova Systems Inc.SN40L PCU Configurable as a systolic array or a SIMD vector unit with M lanes BF16,FP32,INT32,and INT8 compute data types,configurable storage
5、data types Arithmetic,Logical,and Bitwise operations A cross-lane reduction tree(blue)to reduce along the vectorized dimension Tail stage provides transcendental functions,casting,and stochastic rounding capabilities5Copyright 2024 SambaNova Systems Inc.SN40L PMU Programmer managed scratchpad memory
6、 supports concurrent reads and writes Fragmentable,address-generation pipeline that can produce 4 addresses per cycle Data alignment crossbars enable high throughput tensor transformations such as transpose,dilation,downcast Address predication support enables composing multiple PMUs to store a larg