当前位置:首页 > 报告详情

利用 Andes 矩阵乘法 (AMM) 和 RISC-V 向量 (RVV) 扩展加速 AI 模型:从 CNN 到 LLM.pdf

上传人: c** 编号:955322 2025-10-27 15页 956.84KB

1、Pei-Hsiang Hung,Chung-Hua Yen,I-Wei Wu.Andes TechnologyAccelerating AI Models with Andes Matrix Multiplication (AMM)from CNN to LLM1TakingMainstreamSubject to change without notice copyright 2025 Andes TechnologyOutline Introduction to Andes Matrix Multiplication(AMM)Illustrations of AMM Scalability

2、.AMM Code Generation in IREE for LLM deployment Choosing Optimal Tiling Size Generating VLEN-Agnostic code Handling LLM Prefill and Decode Stages.Performance Estimates Conclusion2TakingMainstreamSubject to change without notice copyright 2025 Andes TechnologyAMM Introduction The Andes Matrix Multipl

3、ication(AMM)is being designed to optimize tiledmatrix multiplication.M Ntile=AMM(MK tile,KN tile)Key features:The tiles are stored in the RVV vector registers 2D load and store instructions facilitatedata movement between memory and vector registers.The scalability across RVV VLEN,LMUL and SEW.Under

4、standing Tiling Size M,N and K:(fully tiled cases)1.Mis always 2.2.N equals VLEN/64.3.Kis determined by LMUL and SEW.KNMNKM3TakingMainstreamSubject to change without notice copyright 2025 Andes TechnologyVLEN-Scalable Design VLEN(Vector Length)depends on the specific VPU implementation Conditions fo

5、r the illustration below:F32*F32-F32 LMUL 1VLENMNKLMULSEW124812822I8-I32816326425624F16-F3248163251228F32-F322481610242164TakingMainstreamSubject to change without notice copyright 2025 Andes TechnologyLMUL-Scalable Design LMUL(Vector Length Multiplier)AMM supports the integer LMUL values.Fractional

6、 LMULs are not supported;boundary control is used instead.Conditions for the illustration below:F32 *F32-F32 VLEN 128VLENMNKLMULSEW124812822I8-I32816326425624F16-F3248163251228F32-F322481610242165TakingMainstreamSubject to change without notice copyright 2025 Andes TechnologySEW-Scalable Design SEW(

word格式文档无特别注明外均可编辑修改,预览文件经过压缩,下载原文更清晰!
三个皮匠报告文库所有资源均是客户上传分享,仅供网友学习交流,未经上传用户书面授权,请勿作商用。
根据文章内容,以下是全文关键点的概括: 1. **Andes Matrix Multiplication (AMM)**:AMM旨在优化分块矩阵乘法,使用RVV向量寄存器存储分块,并通过2D加载和存储指令加速数据在内存和寄存器之间的移动。 2. **可扩展性**:AMM设计支持VLEN、LMUL和SEW的可扩展性,以适应不同的VPU实现。 3. **分块大小**:文章讨论了如何确定最优的分块大小(M, N, K),以最大化性能,并生成适用于不同VLEN的代码。 4. **IREE框架**:使用IREE框架生成AMM代码,包括处理LLM预填充和解码阶段。 5. **性能提升**:AMM在GEMM操作中实现了高达6.4倍的性能提升,尤其是在处理CNN和LLM模型时。 6. **未来优化**:计划进一步优化非线性操作,以提升AMM在全模型中的速度提升效果。
"AMM加速AI模型,性能提升多少?" "如何选择最优的AMM分块大小?" "AMM在LLM部署中的优势是什么?"
客服
商务合作
小程序
服务号
折叠