《使用 NKI 在 AWS Trainium 上构建您自己的高性能内核.pdf》由会员分享,可在线阅读,更多相关《使用 NKI 在 AWS Trainium 上构建您自己的高性能内核.pdf(64页珍藏版)》请在三个皮匠报告上搜索。
1、 2025,Amazon Web Services,Inc.or its affiliates.All rights reserved.2025,Amazon Web Services,Inc.or its affiliates.All rights reserved.A I M 4 1 9Build your own high-performance kernel on AWS Trainium with NKIJohn Gray(He/Him)APL Solutions ArchitectAnnapurna Labs,AWS Maen SuleimanHead of Neuron Prod
2、uct ManagementAnnapurna Labs,AWS 2025,Amazon Web Services,Inc.or its affiliates.All rights reserved.AgendaAct I:Why do we need Kernels?Act II:How do we do kernels on Trainium?Act III:Kernel Walkthrough on HardwareNext Steps 2025,Amazon Web Services,Inc.or its affiliates.All rights reserved.2025,Amaz
3、on Web Services,Inc.or its affiliates.All rights reserved.Act 1:Why we need Kernels?2025,Amazon Web Services,Inc.or its affiliates.All rights reserved.GenAI is EVERYWHEREChatbots,Agents,Code Assistants,Image Generation.Performance is critical for the customer experience 2025,Amazon Web Services,Inc.
4、or its affiliates.All rights reserved.Typical AI AcceleratorHost CPUHost MemoryAI AcceleratorPCIe BusTensor EngineVector engineScalar EngineOn-Chip SRAMAccelerator CoreHigh Bandwidth Memory(HBM)2025,Amazon Web Services,Inc.or its affiliates.All rights reserved.Memory HierarchySRAMAccelerator HBMHost
5、 Memory-Size:MBs-Bandwidth:10TB/s-Size:10s GB/s-Bandwidth:TB/s-Size:10s GBs TBs-Bandwidth:0.5GB/s 2025,Amazon Web Services,Inc.or its affiliates.All rights reserved.E X E C T U I O N O N A I A C C E L E R A T O RSimple Linear Layer with ReLU import torchA=torch.tensor(1.,2.,3.,-1.)B=torch.tensor(2.,
6、1.,0.,3.)bias=torch.tensor(1.,-2.)#Matrix Multiplication result=torch.matmul(A,B)#2.,7.,6.,0.#Additionresult=result+bias#3.,5.,7.,-2.#ReLU activationresult=torch.relu(result)#3.,5.,7.,0.2025,Amazon Web Services,Inc.or its affiliates.All rights reserved.Linear Layer with ReLU:Sequence of OperatorsHos