《6567 - LLM Inference Performance Projection.pdf》由会员分享,可在线阅读,更多相关《6567 - LLM Inference Performance Projection.pdf(17页珍藏版)》请在三个皮匠报告上搜索。
1、Mohan J Kumar,Perpetual Intelligence Edmund Song,Intel LLM Inference Performance Projection AI Market Trends AI Refresher Training vs Inference Types of Parallelism MESA LLM Inference Performance Projection Overview and Examples Unique Attributes Summary and Call to ActionAgenda Generative AI revenu
2、e expected to be$1T by 20321Generative AI will be 12%of technology spend by 20321Global AI Market by 2030 will be$1.7T2 Global AI Inference Market expected to grow at 18.4%CAGR to$133B by 20343Inference is excepted to 90%of AI computing by 2030 AI Trends Sources:1 Bloomberg 2 Grand View Research 3 M
3、arket.us Training vs.Inference Large Dataset Car?Error Forward pass Backward pass Model trains on dataset Adjusts parameters to minimize error CarForward pass TrainingInference With new data input,output a prediction Types of Parallelism Multiple full copies of models on different GPU or AI clusters
4、 Increases overall throughput processing multiple request in parallelGPU0 GPU3 GPU2GPU1 Long Sequence is split across multiple GPUs or AI clusters More commonly used in inference Allows handling long sequences without running out of memory Sequence Parallelism Data Parallelism GPU0 GPU3 GPU2GPU1 Seq
5、1Seq2Seq 3Seq 4Input Sequence Types of Parallelism Model split across multiple GPUs or AI clusters E.g.split across 4 GPUs in above illustration Support large models that do not fit within a single unit(GPU or AI cluster)memory constraints Tensor Parallelism Model layers split across multiple GPUs o
6、r AI clusters E.g.4 layers split across 4 GPUs in above illustration Better utilization of AI hardware Pipeline Parallelism GPU0 GPU1GPU3GPU2Types of Parallelism Allows spreading experts across multiple GPUs or AI accelerators or AI clusters Activate only a subset of experts per input avoiding redun