《GPU 加速 python 计算.pdf》由会员分享,可在线阅读,更多相关《GPU 加速 python 计算.pdf(37页珍藏版)》请在三个皮匠报告上搜索。
1、NVIDIAGPU ACCELERATION INPYTHONDominic Wang I Solution ArchitectGTC CHINA#page#AGENDAGetting StartedBackgroundTesting SetupNumba CodeStep through Numba modificationsCuPy CodeStep through CuPy mmodifications#page#AVERAGE USERSC/C+Python106CoActive Developers20PythonC/C+LanguageSource:http#page#WHY AR
2、E WE HERE2?“Am a Python developer but really need the performance of CUDA C+.”“1 have custom arithmetic,i.e, SciPy, that doesnt exist otherGPUaccelerated package,i.e.CuPy.”“i have custom Numba kernels and im nervous about porting code toCuPys RawKernel.”“Are there any improvements that can be made t
3、o my currentNumba/CuPy code?nvID#page#GETTING STARTEDDrop-in GPU LibraryCustom Numba CUDA KernelsCustom Raw CUDAKernelsReplacementsLeverage JIT compilation andTomatch native CUDA speeds,NumPy - CuPyNumbas CUDA support to quicklywrap raw CUDA kernels in CuPy;Pandas - CuDFbuild and test custom CUDApre
4、compile and cache kernel toScikit-Learn - CUMLkernels with a Pythonic APIavoid JIT overheadNetwork-X-CuGraphPros:Pros:PrOS:Quickly build custom featuresMatches CUDA C+ speedTrivial code changeBoilerplate codeNo excess SW layer“Free” PerformanceCons:Cons:Cons;JIT compilation overheadLimited debugging
5、 toolsPotentially sub-optimalExcess register pressureSupport multiple dtypesLimited controlGPUAccelerating SciPySignal withNumba and CuPy ISciPy202013nttp5#page#TESTINGFind and run the codehttps:/ fallconda env create-f gtc_fall.ymlbash test_script.shInput size-210;Outputsize-220Performed onaDGX1Tes
6、laV100-SXM2-32GBIntel Xeon CPUE5-2598 v42.2GHzSetting GPUsudo nvidia-smi-ac 877,1530-i0#Set clockssudo nvidia-smi -pl 250 -i0 # Set power levels#page#PYTHON CODESciPy (Lombscargle)for i in range(freqs.shapeO)XC=0.tau=atan2(2*cs,cc-ss)/(2*freqsi)XS=0.C_tau= cos(freqsitau)CC=0.S_tau = sin(freqsi*tau)c
7、_tau2=c_tau*c_tauSS=0S_tau2=s_tau*s_tauCS=0.cs_tau=2c_tau*s_taufor j in range(x.shape):c=cos(freqsi*xj)(C_tau2*CC+cs_tau*cs+s_tau2*sS))+S=sin(freqsi*xj)(c_tau*xs-s_tau*XC)*2/(C_tau2*55-cs_tau*cs+s_tau2*CC)XC+=yj*creturn pgramXS+=yj*CC+=C*CSource:httpshMD#page#PROFILINGNVTX and Nsight Systemsfrom cup
8、y import prof#Run baseline with scipy.signal.1ombscarglewith prof.time_range(“scipy_lombscargle,):cpu_lombscargle = signal.lombscargle(x,y, f)with prof.time_range(“numba_lombscargle,1):gpu_lombscargle=1ombscargle(d_x,d_y,d_f)#Copyresulttohostgpu_lombscargle =gpu_lombscargle.copy_to_host()#Compareres
9、ultsnp.testingassert_allclose(cpu_lombscargle, gpu_lombscargle, 1e-3)#Run multiple passes togetaveragefor - in range(1oops):with prof.time_range(“numba_lombscargle_loop”,2):gpu_lombscargle = lombscargle(d_x, d_y,d_f)#page#NUMBA CUSTOM KERNEL#page#NUMBA CODEBaselinedef _numba_lombscargle(x,y,freqs,pg
10、ram,y_dot):for j in range(x.shapee)F= cuda.grid(1)C= cos(freq *Xj)strideF = cudagridsize(1)S=sin(freq*xj)if not y_dote:NewyD=1.0XC+=yj*codeelse:Xs+=yj*yD=2.0/y_dotofor i in range(F, freqs.shapelel, strideF)#Copy data toregisterstau=atan2(2.0*CS,cC-Ss)/(2.0*freq)freq = freqsiC_tau=cos(freq*tau)S_tau=
11、sin(freq*tau)XC=0.0C_tau2=c_tauc_tauXS=0.0S_tau2=s_tau*s_tauCC=6.6cs_tau=2.0*c_tau+s_tauSS=0.0CS=6.0mba_yi.p#page#NUMBA CODEVersion 1def _lombscargle(x,y,freqs,pgram,y_dot):if(pgram.dtype=float32);numba_type =float32ALow for multiple kernels based ondatatypeelif(pgram.dtype=float64)numba_type=f1oat6
12、4device_id = cp.cuda.Device()Determine number of blocks forgrid-strideloopingthreadsperblock=(128,)blockspergrid = (numSM * 20,)1sig=_numba_lombscargle_signature(numba_type)Compile Numbakernelkernel= cuda,jit(sig)(_numba_lombscargle)Launch Numba Kernely_dot)kernelblockspergrid, threadsperblock(x, yy
13、freqs,pgram,cuda.synchronize()Block host untilfinished0_Vi.D#page#NUMBA COMPARISONDouble PrecisionSingle PrecisionJTJTKernelKernelRegistersSpeed UpRegistersSpeed Up(ms)(ms)(ms)(ms)1.003,691.13,679.51.003,803.63,809.6SciPyNumba5866413.22.71419.35445.52.11744.38Baseline)Numba(UserCache)Numba(DataType)
14、Numba(FastMath)Numba(MaxRegistersBaselineNumba kernel,implicit casting onsingleprecision.#page#NUMBA CODECached Kerneldef _lombscargle(x,y,freqs,pgram,y_dot)if(pgram.dtype=float32):numba_type=float32elif(pgram.dtype=f1oat64):numba_type=float64if (str(numba_type) in _kernel_cache:kernel = _kernel_cac
15、hel(str(numba_type)Checkifelsecompiled kernelexistsig =_numba_lombscargle_signature(numba_type)kernel = _kernel_cachel(str(numba_type)= cuda,jit(sig(_numba_lombscargle)device_id= cp.cuda.Device()numSM = device_id.attributesMultiProcessorcount”threadsperblock=(128,)blockspergrid =(numSM *20,)kernelbl
16、ockspergrid,threadsperblock(x,y,freqs,pgram, y_dot)Source:https/gitN_V2.D#page#NUMBA COMPARISONDouble PrecisionSingle PrecisionFirst PassFirst PassAverageAverageRegistersSpeed UpRegistersSpeed Up(ms)(ms)(ms)(ms)1.003,691.13,679.51.003,803.63,809.6SciPyNumba662.758413.21419.35445.52.11744.38Baseline)
17、Numba586420.12.72.11419.16465.31756.59(UserCache)Numba(DataType)Numba(FastMath)Numba(MaxRegistersCachedcompiled kernelin userdefineddictionaryskip Numbalogic#page#page#NUMBA COMPARISONDouble PrecisionSingle PrecisionFirst PassFirst PassAverageAverageSpeed UpRegistersRegistersSpeed Up(mS)(mS)(ms)(mS)
18、1.001.003,803.63,809.63,691.13,679.5SciPyNumba662.7582.1413.21419.35445.51744.38(Baseline)Numba66420.12.7582.11419.16465.31756.59(UserCach)Numba662.7402.0481.2479.91428.011862.57(DataType)Numba(FastMath)Numba(MaxRegisters)Adddata typecasting to kernelto minimizeregisterusage#page#NUMBA CODEUsing -us
19、e_fast mathdef _1ombscargle(x,y,freqs,pgram,y_dot)if(pgram.dtype=float32):numba_type=f1oat32elif(pgram.dtype=float64)numba_type=float64if(str(numba_type) in _kernel_cacheKernel = _kernel_cachef(str(numba_type)else:sig=_numba_lombscargle_signature(numba_type)if(pgram.dtype=float32):Addfastelif(pgram.
20、dtype=float64):math(t9tesquotequnu)(anl=ueuase3s)epn=(adkequnu)Jas)uetuday=tuJayflagkernelblockspergrid, threadsperblock(x,y,freqs,pgram, y_dot)Sourcehttps/gitN_V4.D#page#NUMBA COMPARISONDouble PrecisionSingle PrecisionFirst PassFirst PassAverageAverageSpeed UpRegistersRegistersSpeed Up(ms)(ms)(ms)(
21、mS)3,803.61.003,679.51.003,809.63,691.1SciPyNumba662.7582.1413.21419.35445.51744.38(Baseline)Numba66420.12.7582.11419.16465.31756.59(UserCach)Numba662.7402.0481.21428.01479.91862.57(Data Type)Numba2.7332.066495.51428.44478.31888.72(FastMath)Numba(MaxRegisters)Pass-use_fast_math flag,onllyeffectiveon
22、singleprecision.#page#NUMBA CODEUsing -max_registersdef _lombscargle(x,yfreqs, pgram, y_dot)if(pgram.dtype=float32):elif(pgram.dtype=*float64):numba_type=float64if(str(numba_type))in_kernel_cachekernel =_kernel_cache(str(numba_type)else:if(pgram.dtype=float32):kernel =_kernel_cachel(str(numba_type)=
23、 cuda-jit(sig, fastmath=True, max_registers=32)(numba_lombscargle_32)elif(pgram.dtype=f1oat64):kernel = _kernel_cachel(str(numba_type)1= cuda.jit(sig, fastmath=True, mmax_registers=64)(_numba_lombscargle_64)kernelblockspergrid,threadsperblock(x,y,freqs,pgram,y_dot)ppymax reegisters#page#page#CUPY RA
24、W KERNEL#page#CUPY CODEBaselineStored asstring_cupy_1ombscargle_src = Templateextern”c”if(y_dote=0)global_ void _cupy_lombscargledyD=1.0;constintX_shape,J else const int freqs_shapeyD=2.0/y_doto;片const$fdatatyperestrict.const sfdatatyperestrict3Tusauadkaeep)$3suofor(int tid=txitidfreqs_shape;tid+=st
25、ride)freqSstdatatype*restrict_pgram,Sdatatype) freq freqstid Vsfdatatypexcconst int tx fsdatatypexs ;static_cast(sfdatatypel cc ;ExplicitlyblockIdx.x*blockDim.x+threadIdx.x)Sdatatype ss specifydatatypesSdatatype) cs const int stridestatic_cast(sdatatypeCO;blockDim.x*sfdatatype)s;*gridDim.)$fdatatype
26、l yD ;Source:https:/#page#page#CUPY COMPARISONDouble PrecisionSingle PrecisionFirst PassFirst PassAverageAverageRegistersSpeed UpRegistersSpeed Up(ms)(ms)(ms)(ms)1.00SciPy3,803.63,809.63,691.13,679.51.00CuPy582.06108.7113.41820.481.42573.03CuPy(UserCace)CuPy(DataType)CuPy(Fast Math)CuPy(FatbirCuPyit
27、habsenceoftypepromotion#page#CUPY CODECached Kerneldef_1ombscargle(x,y,freqs,pgram,y_dot):if(pgram.dtype=float32):C_type=“float”(.9eo,=adpuad)TC_type=“doubleif(str(c_type)in_kernel_cache:kernel =_kernel_cache(str(c_type)else:Check ifcompiled kernelsrc =_cupy_lombscargle_src,substitute(datatype=c_typ
28、e)existmodule = cp.RawModule(code=src,options=(“-std=c+11”,)kernel = _kernel_cachel(str(c_type)= moduleget_function(_cupy_lombscargle”)kernel_args =(x.shapee,freqs.shape,X,y,freqs,pgram,y_dot,)kernel(blockspergrid, threadsperblock, kernel_args)Source:https:/git#page#CUPY COMPARISONDouble PrecisionSi
29、ngle PrecisionFirst PassFirst PassAverageAverageRegistersSpeed UpRegistersSpeed Up(ms)(ms)(ms)(ms)1.00SciPy3,803.63,809.63,691.13,679.51.00CuPy582.03617113.4108.72123.031820.48CuPy58360.9109.31.6117.92456.034171.66(UserCace)CuPy(DataType)CuPy(Fast Math)CuPy(FatbirCuPyddictionaryskipNumba logic.#page
30、#CUPY CODEExplicit kernel per typeyD=1.0f;cupy_lombscargle_src =extern”c”J else上Kernelglobal_ void -cupy.lombscargle_float32peryD=2.0f/y_dototypeTconst int x shape,const int freqs_shapeconst floatrestrict.for(int tid=txi tidfreqs_shape; tid+=stride )const floatrestrictconst float *float freq freqsti
31、d restrictfreqsfloat*restrictpgram,const float*floatxc-restricty_dotVAfloatxsfloat cc fExplicitlyconst int tx float ssspecifydatatypesfloat cs ;static_cast(float c;blockIdx.x*blockDim.x+threadIdx.x)floatsconst int stride static cast(blockDim.xgridDim.x)float yD :if(ydotro = o)(Source:https:/gcupy_v3
32、.p!#page#CUPY COMPARISONDouble PrecisionSingle PrecisionFirst PassFirst PassAverageAverageRegistersSpeed UpRegistersSpeed Up(ms)(ms)(ms)(ms)1.00SciPy3,803.63,809.63,691.13,679.51.00CuPy582.0361.7113.4108.72123.031820.48CuPy581.6360.9109.32456.03117.94171.66(UserCace)CuPy58320.999.51.6109.22408.45407
33、7.69(DataType)CuPy(Fast Math)CuPy(FatbirCuPyAdd data type casting to kernel to minimize register usage#page#CUPY CODEUsing -use_fast mathdef _lombscargle(x,y,freqs,pgram,y_dot)if(str(pgram.dtype)in _kernel_cache:kernel = _kernel_cache(str(pgramdtype)else:Addmodule = cp.RawModule(code=_cupy_lombscarg
34、le_src, options=(-std=c+11”,“-use_fast_math”)fastkernel_cache(str(pgram.dtype)= moduleget_function(_cupy_lombscargle_” + str(pgram.dtype)mathkernel = _kernel_cache(str(pgram.dtype)flagdevice_id=cp.cuda.Device()numSM = device_id.attributestMultiProcessorCountthreadsperblock=(128,)blockspergrid =(numS
35、M *20,)kernel_args=(x.shapee,freqs.shapeex,y,freqs,pgram,y_dot,)kernel(blockspergrid,threadsperblock,kernel_args)Source:https:/gitUDy_V4.D#page#CUPY COMPARISONDouble PrecisionSingle PrecisionFirst PassFirst PassAverageAverageRegistersSpeed UpRegistersSpeed Up(ms)(ms)(ms)(ms)1.00SciPy3,803.63,809.63,
36、691.13,679.51.00CuPy582.0361.7113.4108.72123.031820.48CuPy581.636117.90.9109.32456.034171.66(UserCace)CuPy5899.51.632109.20.92408.454077.69(DataType)CuPy3258107.2107.60.21.62415.2116026.22(Fast Math)CuPy(FatbirCuPyPass-use_fast_math flag,only effectiveon single-precision.#page#CUPY CODELoading from
37、fatbindef _lombscargle(x,y,freqs,pgram,y_dot)if (str(pgram.dtype) in _kernel_cache:kernel = _kernel_cache(str(pgramdtype)else:Load precompile kernelsfrom fatbinmodule =cp.RawModule(path=./_1ombscargle.fatbin”)kernel_kernel_cache(str(pgram.dtype)= moduleget_function(_cupy_lombscargle_”+ str(pgram.dty
38、pe)device_id = cp.cuda.Device()numSM = device_id.attributesfMultiProcessorCountthreadsperblock=(128,)blockspergrid =(numSM*20)kernel_args=(x.shapere,freqs.shapee,x,y,freqs,pgrams y_dot,)kernel(blockspergrid, threadsperblock, kernel_args)ource:https:/gitFer/gtc_fall_cupy_v5.D#page#CUPY CODELoading fr
39、om fatbinnvcc-fatbin -std=c+11-use_fast math-generate-code arch=compute_35,code=sm_35-generate-code arch=compute_35,code=sm_37-generate-code arch=compute_50,code=sm 50-generate-code arch=compute_50,code=sm_52Compile SASS foral-generate-code arch=compute_53,code=sm_53architectures-generate-code arch=
40、compute_60,code=sm 60Compile PTX for only7.5-generate-code arch=compute_62,code=sm_62-generate-code arch=compute_70,code=sm_70-generate-code arch=compute_72,code=sm_72-generate-code arch=compute_75,code=sm_75,compute_75lombscargle.cu -odir ource:https59e.#page#CUPY COMPARISONDouble PrecisionSingle P
41、recisionFirst PassFirst PassAverageAverageRegistersSpeed UpRegistersSpeed Up(ms)(ms)(ms)(ms)1.00SciPy3,803.63,809.63,691.13,679.51.00CuPy582.0361.7113.4108.72123.031820.48CuPy581.636117.90.9109.32456.034171.66(UserCace)CuPy5899.51.6320.92408.45109.24077.69(DataType)CuPy3258107.2107.60.21.62415.21160
42、26.22(Fast Math)CuPy32480.24.81.52533.806.615964.36(FatbirCuPy#page#CUPY CODEUsing _launch_bounds_0templatextypename Tdevicevoid_cupy_lombscargle_double( const int x_shapeconst int freqs_shape,constT*_restrictTemplate真constT*_restricty,wrapperconstT*_restrict_freqs,T*restrict_pgram,const T+_restrict
43、_y_dot )(2n “C” -_global_ void_launch_bounds_( 128 )_cupy_lombscargle_float64(const int x_shape,externconst int freqs_shape,const double *_restrictconst double Same asrestrict.y,threads perblockconst double *_restrictfreqs,double*_restrict_pgram,const double y_dot)f_restrictcupy_1ombscargle_doublecd
44、ouble( x shapefreqs_shape,x,y,freqs,pgram,y_dot)Source:https/github.cargle_ib.c#page#CUPY COMPARISONDouble PrecisionSingle PrecisionFirst PassFirst PassAverageAverageRegistersSpeed UpRegistersSpeed Up(ms)(ms)(ms)(ms)1.00SciPy3,803.63,809.63,691.13,679.51.00CuPy582.036113.4108.71.42573.031820.48CuPy5
45、81.636117.90.9109.32456.034187.66(UserCace)CuPy5899.51.6320.92408.45109.24105.69(DataType)CuPy3258107.2107.60.21.62415.2117335.22(Fast Math)CuPy32484.8153.50.22533.8017611.36(FatbirCuPy1.5484.8313.00.22469.8517278.29Addlaunch_boundstokernels,allowingfurthercompileroptimizations.#page#page#NVIDIA#page#