1、OCP Global Summit October 18,2023|San Jose,CASYM Title SlidePeipeiZhouAssistantProfessor,University of PittsburghArchitectural Challenges and Innovation for Compute Infrastructure Co-DesignSYM-ContentGenerativeAIModels:ChatGPTSYM-ContentGenerativeAIModels:StableDiffusion,Dall-ESYM-ContentTransformer
2、ModelsSYM-ContentProfiling Transformer based model,DeiT-T,on Nvidia GPU T4(TSMC12 nm)Low TensorCores utilization for INT8 MM kernels.TensorRT adopts an implicit quantization policy,which leads to BMM computing in FP32,which could originally be in INT8.The quan/dequan between FP32 and INT8 consumes n
3、on-negligible GPU cycles The data layout change also consumes nonnegligibleGPU cycles The nonlinear kernels,e.g.,Softmax,GeLU,Layernorm,take significant GPU cyclesKernelBreakdownSYM-ContentFPGA vs.GPU?GPU+FPGA?SYM-ContentVersal ACAP ArchitectureDDR4-DIMMAIE ArrayIOAIEVLIWProcessor32KB Mem25.6 GB/s1.
4、2 TB/sProgrammable LogicBRAMURAMCLBDSPNOCProcessor System(ARM)HeterogeneousAcceleratorArchitectureFine-GrainedPipelineINTNon-linear Functions(Softmax,GELU)01234567DeiT-256LV-ViT-TDeiT-TDeiT-160GPU TensorRTACAP CHARM(ours)ReducesLatencyby10 x overNvidia GPUT45.7x10.3x7.3x8.9xFromHeterogeneous Modelst
5、oHeterogeneous SystemComputation-Communication AwareScale-Out?SYM-ContentH2H:heterogeneous model to heterogeneous system mapping with computation and communication awareness,DAC 2022LowerLatency,LowerEnergyH2H:heterogeneous model to heterogeneous system mapping with computation and communication awa
6、reness,DAC 2022https:/ Modelsto Heterogeneous Chiplet SystemswithHeterogeneousComponentsComputation&Communication AwareHierarchical Scheduling&MappingLatencyvsThroughputChiplet?Sustainability?Source of CO2e from Meta DatacentersRepackaging ChipletsNSF CCF#2324