《The Case for Computational Offload to CXL Memory Devices for AI Workloads.pdf》由会员分享,可在线阅读,更多相关《The Case for Computational Offload to CXL Memory Devices for AI Workloads.pdf(13页珍藏版)》请在三个皮匠报告上搜索。
1、The Case for Computational Offload to CXL Memory Devices for AI WorkloadsJon Hermes,Staff Software Engineer,Arm Ltd.The Case for Computational Offload toCXL Memory Devices for AI WorkloadsARTIFICIAL INTELLIGENCE(AI)AIWe imagine a future in which the use of dedicated far memory pools are commonoHow d
2、o we utilize these pools efficiently?oHow do we overcome latency and bandwidth limitations from CXL-attached memory?The rise of single-socket performance relative to interconnect performance(socket-socket,socket-device)is exacerbating the harm for memory-sensitive workloadsWe must support strategies
3、 that perform well when there is no option to run in local memory without the use of memory poolsIt should be possible to both mitigate performance losses and make use of CXL memory pools by dispatching targeted compute tasks to the pool.Vision for CXL and Disaggregated ComputeTwo primary test syste
4、ms:oPlatform TX(256G mem,dual socket 56-core TX2 99xx)oPlatform N1(512G mem,dual socket 120-core Arm N1)Using the same strategy from Azures Pond paper:oForce cross-socket NUMA access toimpose latency and limit bandwidthFar socket clock rate is slowed to better emulate a real deviceReal CXL hardware
5、should be no better in terms of latency or bandwidth than this emulationModeling CXL without CXL HardwareCXL DeviceCPU-DRAMCXL BridgeSocket 1Socket 0InterconnectLocal NodeRemote NodeIdeal System SetupEmulation SetupApplications have various levels of sensitivity to memory latency and bandwidth limit
6、ations and are not harmed uniformly.As a quick test,we broke out just one function(a sparse matrix multiply)from one of the models we wanted to test and saw these results:We first looked at offloading individual memory-sensitive functions within various AI workloads,then looked toward generalization