《通过将计算和通信卸载到智能网卡来加速高性能计算和人工智能应用.pdf》由会员分享,可在线阅读,更多相关《通过将计算和通信卸载到智能网卡来加速高性能计算和人工智能应用.pdf(34页珍藏版)》请在三个皮匠报告上搜索。
1、Dhabaleswar K(DK)PandaThe Ohio State UniversityAccelerating HPC and AI Applications by Offloading Computation and Communication to SmartNICS2Trends in Modern HPC ClustersAccelerators(such as GPUs)High compute powerHigh peak memory bandwidth(H100:900 GB/s NVLINK)High Performance Interconnects InfiniB
2、and(DPUs),Omni-Path,EFA Better PerformanceCatch:Who will progress communication(Can we dedicate this task to DPU cores?)Concept of Non-blocking CollectivesApplicationProcessApplicationProcessApplicationProcessApplicationProcessComputationCommunicationCommunicationSupport EntityCommunicationSupport E
3、ntityCommunicationSupport EntityCommunicationSupport EntityScheduleOperationScheduleOperationScheduleOperationScheduleOperationCheck ifCompleteCheck ifCompleteCheck ifCompleteCheck ifCompleteCheck ifCompleteCheck ifCompleteCheck ifCompleteCheck ifComplete11Network Based Computing LaboratorySIAM-PP(M
4、ar 24)Major Opportunity and BenefitsOverlap of Computation with CommunicationReducing the overall application execution time12Network Based Computing LaboratorySIAM-PP(Mar 24)Major ChallengesSuitable Host-DPU-Network communication mechanismsEfficient Non-blocking Collective Algorithm offloadLoad bal
5、ancing across ARM cores to take care of the offloading tasksRe-designing applications/middleware using the offloaded strategies to extract higher performance benefits13MVAPICH2-DPU Library ReleaseSupports all features available with the MVAPICH2 release(http:/mvapich.cse.ohio-state.edu)Novel framewo
6、rk to offload non-blocking collectives to DPUOffloads non-blocking Alltoall(MPI_Ialltoall)to DPUOffloads non-blocking Broadcast(MPI_Ibcast)to DPUAvailable from X-ScaleSolutions as a commercial product,please contact contactusx-.14Total Execution Time with osu_Ialltoall(32 nodes),BF-2Benefits in Tota