当前位置:首页 > 报告详情

通过在 IREE 中启用 RISC-V 微内核支持来加速 GenAI 工作负载.pdf

上传人: c** 编号:955327 2025-10-27 17页 1.73MB

1、Accelerating GenAI Workloads by Enabling RISC-V Microkernel Support in IREEAdeel Ahmad,Ahmad Tameem,Nouman Amir,Bilal Zafar,Saad Bin Nasir10 xEngineersOutlineGenerative AI workloadsIREE compilation with custom microkernels(ukernels)Custom RISC-V matrix multiplication ukernels-implementationKernel-an

2、d model-level resultsSummary2Generative AI WorkloadsConversational LLMsGenerative AI workloads are dominated by transformer-based auto-regressive large language models(LLMs)text/image/code generation,chatbots,content writing,video generation and other common uses-cases heavily employ LLMsMatrix-matr

3、ix and matrix-vector multiplications dominate these workloadsSource:Chatgpt3IREE Compilation with Custom KernelsOpen-source direct code generation MLIR-based compiler and runtimeHost/device programming model with multiple target architectures through a hardware abstraction layer(HAL)stack is mostly

4、architecture agnostic step towards heterogeneous compilationHost does scheduling,vm-bytecode for runtime portabilityDevice-side codegen;Upstream IREE has RVV codegen through LLVMMicrokernelsIntended to prevent the dichotomy between compiler and kernelsperform arithmetic but no memory allocationstand

5、alone development and unit testing in C leads to quicker development4Matrix Multiplication ukernel(mmt4d)Compilation in IREEFor x86_64 and ARM64 architectures,IREE leverages linalg dialects mmt4d op for matrix multiplicationmmt4d op is meticulously optimized to exploit hardware-specific vector instr

6、uctions and cache hierarchiesMaterializeHostEncodingPassCPULowerToUKernelsPassLowerUKernelOpsToCallsPass+Only relevant parts of MLIR and pass pipeline are shownmatmul pack+mmt4d+unpackmmt4d iree_uk_mmt4d ukernel call ConvertToLLVMPassmatmul.mlirPrecompiled ukernel bitcodeukernel_bitcode_*.bcStatic l

word格式文档无特别注明外均可编辑修改,预览文件经过压缩,下载原文更清晰!
三个皮匠报告文库所有资源均是客户上传分享,仅供网友学习交流,未经上传用户书面授权,请勿作商用。
根据报告的内容,全文主要内容概括如下: - **生成式AI工作负载**:以Transformer为基础的大型语言模型(LLMs)在生成式AI工作负载中占主导地位,如文本、图像、代码生成、聊天机器人等。 - **IREE编译与自定义微内核**:IREE是一个基于MLIR的编译器和运行时,支持多种目标架构,通过硬件抽象层实现架构无关性。微内核用于执行算术运算,避免编译器和内核之间的二分法。 - **RISC-V矩阵乘法微内核**:实现了针对F16xF16到F32的RISC-V ukernels,优化了矩阵乘法性能。 - **性能提升**:在LLM的预填充和解码阶段,自定义矩阵乘法微内核实现了约2倍和50倍的单线程运行时性能提升。 - **基准测试**:在MILK-V Jupiter板上进行的基准测试显示,预填充阶段的pack操作占用了60%以上的计算时间,而编译时的const-eval优化可以消除这一成本。 - **总结**:微内核的引入显著提升了RISC-V架构在生成式AI工作负载中的性能,未来将推动更多开源贡献和协作,以优化RISC-V基于的ML kernels。
"RISC-V加速AI,性能翻倍?" "IREE矩阵乘,微内核助力!" "LLM微内核,编译时优化大揭秘!"
客服
商务合作
小程序
服务号
折叠