1、#page#CONTENTWhats CUDA GraphHow to Use CUDA GraphLaunch Overhead in TensorFlowIntegrate CUDA Graph into TensorFlowPerformance#page#WHATS CUDA GRAPHWhat Problem CUDA Graph SolvesReduce Launch OverheadsDBgraph launchBD#page#WHATS CUDA GRAPHStream Launch vs Graph LaunchStream LaunchStreamQueuesExecuti
2、onGrfdManagementCUDAA)BlockAO0WSs22BlockA1Dxx6,s2()SM1GrCompetionhttps:/dem/gtc/2020/video/s21760#page#WHATS CUDA GRAPHStream Launch vs Graph LaunchGraph Launch (Pre-Ampere)CUDA Graph LaunchStreamQucuesGridManagementExecutionBlockAOSMOBlockA1SM1cudaGraphLuanchg1,51)mOther DependenciesGrid Completion
3、https:/dem/gtc/2020/ido/21760#page#page#WHATS CUDA GRAPHStream Launch vs Graph LaunchLaunch overhead comparison (test using empty kernel)A100 GPU *Graph with 32 nodesgraphstreamPatternhost(ms) device (ms)host(ms)device (ms)host speedupdevice speedup4.4314.72.21striaght line28.1265.2560.672two branch
4、es21.85.43.1715.4769.2583.4693.7521.97.63fork andjoin4.2821.32161.79#page#HOW TO USE CUDA GRAPH口 Define a CUDA Graph口 Stream Capture口 CUDA Graph API口Instantiate a CUDA Graph口Call cudaGraphlnstantiate()口Launch the CUDA Graph executable instance口Call cudaGraphLaunch(.)https:/docCUDARTGRAPH.html#page#H
5、OW TO USE CUDA GRAPHDefine a CUDA GraphStream CaptureGraph APIScudaStreamBeginCapture(stream1)/Createthegraphitstartsstream2emptystreamcudaGrapnCrtcudaEventRecordevent1,stream1nt1)anodeParams)BcudaEventRecordlevent2.stream2)/ Now set up dependencies on each node/A-BcudaGraphAddDependencies(sraph,a,8
6、c,/A-Cdssuudppudeep/B-DcudaGraphAddDependenciesgraph,ac,ad,1)/C-D/End capture in the origin stream#page#HOW TO USE CUDA GRAPHCUDA Graph Node TypesKernelCPU function callMemory copyMemsetEmpty nodeChild graph#page#LAUNCH OVERHEAD IN TENSORFLOWTF op SchedulingNode A is ready to Runready_nodes = A,Call