1、1|2024 SNIA.All Rights Reserved.AI:Pushing Infra boundariesMemory is a key factorPresented byManoj Wadekar,MetaAI Systems Technologist2|2024 SNIA.All Rights Reserved.Meta Community Statisticspeople use at least one of Meta services monthly,approximately3.98BFamily Daily active users3.19BRef:Meta 4Q2
2、3 Results3|2024 SNIA.All Rights Reserved.Ranking and Recommendations Personalized Recommendations Deep Learning Recommendation Models(DLRM)Training and Inference Generative AI:Large Language Models and more Llama2 Open access to LLMs for research and commercial use Training and Inference(Prefill and
3、 Decode)AI Use Cases at Meta4|2024 SNIA.All Rights Reserved.AI Challenging DC Infra5|2024 SNIA.All Rights Reserved.AI needs for DC Infra CPU-centric Scale-out applications Millions of small stateless applications Failure handling through redundancy Scale performance through large number of nodes Acc
4、elerator-centric AI Apps AI job spread across 1000s of GPUs Failure penalty of large job restart Performance scaling depends on all the components in the cluster(GPU/Accel,memory,network.)6|2024 SNIA.All Rights Reserved.AI Jobs:Scaling the performanceGPU4GPU12GPU20GPU28GPU0GPU8GPU16GPU24Pipeline Par
5、allelTensor/Context ParallelData Parallel7|2024 SNIA.All Rights Reserved.AI Jobs:Scaling the performanceGPU4GPU12GPU20GPU28GPU0GPU8GPU16GPU24Pipeline ParallelTensor/Context ParallelData ParallelScale-Out Network(High Bandwidth)Scale-Up Network(Highest Bandwidth,lowest latency)8|2024 SNIA.All Rights
6、Reserved.Diversity of AI system requirements Difficult to serve all classes of models with a single system design point AI use cases are pushing all the design points through software/hardware co-design Need for innovation in all the design points:compute,network,memory,packaging,connectivity,coolin