《2025年NVIDIA AI研发技术开放日主题演讲PPT(206页).pdf》由会员分享,可在线阅读,更多相关《2025年NVIDIA AI研发技术开放日主题演讲PPT(206页).pdf(206页珍藏版)》请在三个皮匠报告上搜索。
1、主主 题题 演演 讲讲时间时间主题主题演讲人演讲人13:30-13:40开场致辞开场致辞李曦鹏李曦鹏13:40-14:20GPU 编程和优化编程和优化 最佳实践分享最佳实践分享刘冰刘冰/郑鹏郑鹏14:20-14:50在在NVIDIA NeMo中实现大语言模型全周期开发中实现大语言模型全周期开发 以以LLaMa2为例为例 姚鑫姚鑫/颜子杰颜子杰14:50-15:20TensorRT Hackathon 2023 总结总结AIGC及大语言模型推理的典型案例深入解析及大语言模型推理的典型案例深入解析季光季光/陈庾陈庾15:20-15:40向量数据库的加速策略和实战向量数据库的加速策略和实战王雍王雍/
2、张静蓉张静蓉15:40-16:00推荐系统的最新优化策略和实践推荐系统的最新优化策略和实践 以以HPS为例为例魏英灿魏英灿/王泽寰王泽寰分组讨论及答疑分组讨论及答疑分组分组主题主题位置位置答疑专家答疑专家1GPU 专家专家:Tensor Core 编程答疑编程答疑 Nsight 示例解析示例解析悦府悦府 10厅厅刘冰刘冰/郑鹏郑鹏/郁凡郁凡/王猛王猛2大语言模型训练大语言模型训练:大语言模型资源分析大语言模型资源分析NVIDIA NeMo 代码代码解析解析悦府悦府 11厅厅颜子杰颜子杰/陶砺陶砺/姚鑫姚鑫3TRT LLM 以及扩散模型以及扩散模型:TRT LLM 代码解析代码解析demoDif
3、fusion 代码解析代码解析悦府悦府 12厅厅季光季光/薛博阳薛博阳/陈庾陈庾/方杰方杰4向量数据库向量数据库:Top-k 答疑答疑RAFT深入讨论深入讨论开放区开放区 左侧左侧(悦府悦府 12厅对面)厅对面)王雍王雍/张静蓉张静蓉/董建兵董建兵5推荐系统的训练与推理推荐系统的训练与推理开放区开放区 右侧右侧(悦府悦府 10厅对面)厅对面)魏英灿魏英灿/王泽寰王泽寰/张耀斌张耀斌/孙凯孙凯欢 迎 致 辞李 曦 鹏NVIDIA 开发与技术部 亚太区总经理GPU 编程和优化 最佳实践分享刘 冰&郑 鹏7GPU 编程和优化 最佳实践分享Petrick Liu 刘冰,Devtech|Perkz Zh
4、eng 郑鹏,Devtech 8CUDA Optimization FundamentalsUnderstand what is Global Memory Coalesced AccessUnderstand what is Shared Memory Bank ConflictWhat are ILP and TLPCase StudyWhy fuse the MHAFMHA as exampleAgenda9CUDA Optimization FundamentalsUnderstand what is Understand what is Global Global Memory Me
5、mory Coalesced AccessCoalesced AccessUnderstand what is Shared Memory Bank ConflictWhat are ILP and TLPCase StudyWhy fuse the MHAFMHA as exampleAgenda10GPU ArchitectureGPU:Massive Throughput Machine,Keep the Throughput Maximumfull GH100 with 144 SMsGH100 streaming multiprocessorH100 SXM5:DRAM:3352 G
6、B/sFP32 non-Tensor:66.9 TFLOPSFP16 dense-Tensor:984.9 TFLOPSFP8 dense-Tensor:1978.9 TFLOPS11Understand what is Global Memory Coalesced AccessTypical Example Global memory loads and stores by threads of A Warp are coalesced by the device into as few as possible transactions.Access unit is 32-byte(Als
7、o called Sector)Dram-L2-L1 Example:Threads in a warp access adjacent float value.32 threads access 32x4 Bytes=128 Bytes=4 x 32B=4 Sectors(Show in Red)float val=(float*)srcthreadIdx.x;T0 T7 =Sector 0 T8 T15 =Sector 1 T16T23 =Sector 2 T24T31 =Secror 312Understand what is Global Memory Coalesced Access
8、Misaligned Example Global memory loads and stores by threads of A Warp are coalesced by the device into as few as possible transactions.Access unit is 32-byte(Also called Sector)Dram-L2-L1 Example:Threads in a warp access adjacent float value,but with an offset,like 5.32 threads access 32x4 Bytes=12
9、8 Bytes=4 x 32B=4 Sectors(Ideal)But will access 5 Sectors(Actual)(Show in Red)float val=(float*)srcthreadIdx.x+offset;T0 T2 =Sector 0 T3 T10 =Sector 1 T11T19 =Sector 2 T20T27 =Sector 3 T28T31 =Sector 413Understand what is Global Memory Coalesced AccessStride Access Example Global memory loads and st
10、ores by threads of A Warp are coalesced by the device into as few as possible transactions.Access unit is 32-byte(Also called Sector)Dram-L2-L1 Example:Stride of 2:T0 T4 =Sector 0 T5 T7 =Sector 1 T7 T11 =Sector 2 T12T15 =Sector 3 T16T19 =Sector 4 T20T23 =Sector 5 T24T27 =Sector 6 T28T31 =Sector 71 w
11、arp access 128 Bytes=4 x 32B=4 Sectors(Ideal).But it will access 8 Sectors(Actual)14Understand what is Global Memory Coalesced AccessStride Access Example Global memory loads and stores by threads of A Warp are coalesced by the device into as few as possible transactions.Access unit is 32-byte(Also
12、called Sector)Dram-L2-L1 Example:Stride=32B:T0 =Sector 0 T1 =Sector 1 T2 =Sector 2 T3 =Sector 3.T30=Sector 30 T31=Sector 311 warp access 128 Bytes 4 x 32B=4 Sectors(Ideal).But it will access 32 Sectors(Actual)15Understand what is Global Memory Coalesced AccessStride Access vs Coalesced Access Exampl
13、e Global memory loads and stores by threads of A Warp are coalesced by the device into as few as possible transactions.Access unit is 32-byte(Also called Sector)Dram-L2-L1 Assume 1024 threads in each block,each block copy 4096 elements.Test with L1 cache enable&disable,by Xptxas dlcm=ca or Xptxas dl
14、cm=cg (ca is for cache all,including L1;cg is for cache global,excluding L1)16Understand what is Global Memory Coalesced AccessStride Access vs Coalesced Access Example Global memory loads and stores by threads of A Warp are coalesced by the device into as few as possible transactions.Access unit is
15、 32-byte(Also called Sector)Dram-L2-L1 On A100-40GB,total 400*4096 float L1 Cache enabled:L1 Cache disabled:Conclusion:Try your best to coalesce every global memory access.17Understand what is Global Memory Coalesced AccessStride Access vs Coalesced Access Example Global memory loads and stores by t
16、hreads of A Warp are coalesced by the device into as few as possible transactions.Access unit is 32-byte(Also called Sector)Dram-L2-L1 CUDA provide built-in vector data type,such as float4,float2,int4,int2,etc.Can be used when the aligments meets the requirements.On A100-40GB,total 400*4096 float L1
17、 Cache enabled:Conclusion:Try your best to coalesce every global memory access.18Understand what is Global Memory Coalesced AccessCoalesced Access with vec type Example Global memory loads and stores by threads of A Warp are coalesced by the device into as few as possible transactions.Access unit is
18、 32-byte(Also called Sector)Dram-L2-L1 CUDA provide built-in vector data type,such as float4,float2,int4,int2,etc.Can be used when the alignments meets the requirements.On A100-40GB,total 400*8192 float,L1 enabled:Conclusion:Try to use vec type to access memory when the aligments requirements are me
19、t.19CUDA Optimization FundamentalsUnderstand what is Global Memory Coalesced AccessUnderstand what is Shared Memory Bank ConflictUnderstand what is Shared Memory Bank ConflictWhat are ILP and TLPCase StudyWhy fuse the MHAFMHA as exampleAgenda20Understand what is Shared Memory Coalesced AccessOfficia
20、l Shared Memory Access Example Shared memory has 32 banks that are organized such that successive 32-bit words map to successive banks.Each bank has the bandwidth of 32 bits per clock cycle.Shared memory can provide bandwidth 32x4B=128B/cycleConflict-free!21Understand what is Shared Memory Coalesced
21、 AccessOfficial Shared Memory Access Example Shared memory has 32 banks that are organized such that successive 32-bit words map to successive banks.Each bank has the bandwidth of 32 bits per clock cycle.Shared memory can provide bandwidth 32x4B=128B/cycleBank Conflict!Need twice time to return data
22、.22Understand what is Shared Memory Coalesced AccessOfficial Shared Memory Access Example Shared memory has 32 banks that are organized such that successive 32-bit words map to successive banks.Each bank has the bandwidth of 32 bits per clock cycle.Shared memory can provide bandwidth 32x4B=128B/cycl
23、eConflict-free!23Understand what is Shared Memory Coalesced AccessOfficial Shared Memory Access Example Shared memory has 32 banks that are organized such that successive 32-bit words map to successive banks.Each bank has the bandwidth of 32 bits per clock cycle.Shared memory can provide bandwidth 3
24、2x4B=128B/cycleConflict-free!24Matrix Transportation as ExampleNave ImplementationUncoalesed global memory access here.25Matrix Transportation as Exampledifferent color means different warps Bank Conflict!26Matrix Transportation as Exampledifferent color means different warps,black means pad Conflic
25、t-free!27CUDA Optimization FundamentalsUnderstand what is Global Memory Coalesced AccessUnderstand what is Shared Memory Bank ConflictWhat are ILP and TLPWhat are ILP and TLPCase StudyWhy fuse the MHAFMHA as exampleAgenda28What are ILP and TLP TLP:Thread-level parallelism.ss W ILP:Instruction-level
26、parallelism.Warp0 Thread031-x=x+ax=x+bx=x+cWarp1 Thread031-y=y+ay=y+by=y+cWarp2 Thread031-z=z+az=z+bz=z+cWarp3 Thread031-w=w+aw=w+bw=w+cscheduler have 4 independent group of threads to executeWarp0 Thread031-x=x+ay=y+az=z+aw=w+ax=x+by=y+bz=z+bw=w+bscheduler have 4 independent instructions to execute
27、We prefer enough threads to do the worksWe prefer one threads to do more works29Why we need ILP or TLPGPU is a throughput machinehttps:/ Our goal is to maximum the execution bandwidth.What we need is enough requests to hide the latency,no matter from ILP or TLP30Why we need ILP or TLPTo achieve max
28、DRAM bandwidth.For A100-80GB at 1410MHz:Global memory latency 500 Cycles Global memory bandwidth 1555GB/s Need requests=536KB,if 4B/thread=137216 threads =20 warps/CTA*216 CTA if we have 216 CTA *4 warps/CTA=20B/thread=5 float/thread Test with 16384x800 float vector.Run 100 times to get the average
29、executing time.Launch 216 blocks,WarpNum x 32 threads.31Why we need ILP or TLPTo achieve max DRAM bandwidth.We can use both.For A100-80GB at 1410MHz,1555GB/s,Test with 16384x800 float value TLP_4=4 warps in block;TLP_32=32 warps in block ILP_1=each thread copy 1 float/float4;ILP_32=each thread copy
30、32 float/float4 Total 216 blocks32Why we need ILP or TLPTo achieve max math throughput.For A100-80GB at 1410MHz,108SM,19.5TFlops for FP32 CUDA core For a warp:Issue FP32 FFMA inst need 2 cycles(32/16)Wait a dependent FFMA inst need 4 cycles.Questions:How many independent FFMA do we need?33Why we nee
31、d ILP or TLPTo achieve max math throughput.For A100-80GB at 1410MHz,108SM,19.5TFlops for FP32 CUDA core For a warp:Issue FP32 FFMA inst need 2 cycles(32/16)Wait a dependent FFMA inst need 4 cycles.Question:How many independent FFMAs do we need?2!Can from ILP or TLP34Why we need ILP or TLPTo achieve
32、max math throughput.For A100-80GB at 1410MHz,108 SM,FP32 CUDA core 19.5 TFLOPS For 1 SM can provide max 180.5 GFLOPs N_ITERATIONS should be big enough to make FFMA instructions take most of the time.Run 100 times to get the average executing time.Launch 1 block,WarpNum x 32 threads.35ILP or TLP?Whic
33、h one would be preferred in the application?It depends on compute density of the problem.For the memory bound problem.Both are ok,since no data reusing,just try to maximum the memory bandwidth.For the compute bound problem.ILP is highly recommended.The registers can bring highest R&W bandwidth.For a
34、 FFMA,R1,R2,R3,R4.Register Read BW:3 x 4B x 128(threads)/2 cycle=768B/SM/clock.For shared memory,Read BW=128B/SM/clock We should try to keep data in the registers to utilize the highest bandwidth.Else,we will get stuck by Smem or L2 or DRAM bandwidth.36CUDA Optimization FundamentalsUnderstand what i
35、s Global Memory Coalesced AccessUnderstand what is Shared Memory Bank ConflictWhat are ILP and TLPCase StudyWhy Why fusefuse the MHAthe MHAFMHA as exampleAgenda37GEMM performance analysis CUTLASS Efficient GEMM ImplementationK iteration MMA1 CTA Execution TimeEpilogue38GEMM performance analysis CUTL
36、ASS main loop pipeline implementation Prologue:Prepare matA slice,matB slice.Instructions:LDG-STS-BAR-LDS Must wait global latency.Main loop:Prefetch next matA and matB slice from global.Fetch fragments A and B from shared mem.Do FMA or MMA instructions.Repeat until K is completed.Mostly,global mem
37、latency and shared mem latency are well hidden.Main loop time is proportional to the size of K.Epilogue:Do activation and add bias.Store accumulators to shared mem and reorganized the layout to maximize global bandwidth.Store the matC tile to the global mem.Most time is spent on global write back,ti
38、me is proportional to the size of matC tile.prologueepilogue39GEMM performance analysis Keys to achieve peak Tflops.In order words,we should let the ALL SM unit busy during the whole kernel time.The prologue and epilogue time will affect us to achieve the peak performance,though they are essential.s
39、mall k large kFor FMHAGEMM1 shape:M,N,K:(seq,seq,headsz)GEMM2 shape:M,N,K:(seq,headsz,seq)headsz is usually small,and epilogue time of GEMM1 is big enough.launchLoad ActivationLoad WeightsMain LoopStore ActivationMain LoopStore ActivationsoftmaxLoad Weights40CUDA Optimization FundamentalsUnderstand
40、what is Global Memory Coalesced AccessUnderstand what is Shared Memory Bank ConflictWhat are ILP and TLPCase StudyWhy fuse the MHAFMHA FMHA as exampleas exampleAgenda41Fused Multi-Head AttentionReducing(2)memory footprint Block handles O matrix one tile at a time,=(,),CTA loops over q_loop to finish
41、 full O matrix Pseudo-code 1:Load K,V from GMEM to SMEM 2:Load K,V from SMEM to RF 3:For i in q_loop:4:Load TileQ from GMEM to SMEM 5:Load TileQ from SMEM to RF 6:Compute TileS=TileQ*KT 7:Compute TileP=Softmax(TileS)8:Compute TileO=TileP*V 9:Sum TileO across warps(local split-k reduction)10:Store Ti
42、leO to GMEM11:Move Q,O pointer Drawbacks Non-scalable to larger N due to O(N)Register/Shared memory footprint the upper bound of sequence length is 512 normallyScale to longer sequence length?Multiple blocks need inter-block communication still better than unfused kernelsQP=Softmax(QKT)VOKTq_loopSTE
43、PND42Flash AttentionScalable fused multi-head attention FlashAttention is an IO-aware exact attention algorithm that uses tiling to reduce the number of memory reads/writes between GPU high bandwidth memory(HBM)and GPU on-chip SRAM.Target:avoid(2)global memory access for any sequence length Why:stat
44、istics(max/)needs full S matrix(),while registers/shared memory are not enough.How:()=()=0()()=0()can be taken as const normalization scales in one pass.()(element-wise operations)split-k like what we do in normal GEMM.Note:We need()in case of overflow/underflow.exp _ exp Potential issues:Precision
45、issues exist if the weights are not trained based on Flash Attention.QP=Softmax(QKT)VOKTfor j in kv_loopfor j in kv_loopfor i in q_loopNote:1.Rabe,MarkusN.,and Charles Staats.Self-Attention Does Not Need$O(N2)$Memory.2.Dao,Tri,et al.FlashAttention:Fast and Memory-Efficient Exact Attention with IO-Aw
46、areness.3.2.Dao,Tri.FlashAttention-2:Faster Attention with Better Parallelism and Work Partitioning.43Best cuda programming practice based on FAGrid-Level Optimization How to parallelize tiles into multiple independent blocks?Sequence Q?Totally independent,so we can distribute to different blocks.It
47、 is beneficial for higher GPU occupancy.Sequence KV?Not needed:additional inter-block communication&Q tiles have already occupied the GPU.flash decoding(for inference):split kv sequence into multiple blocks for higher GPU occupancy as q sequence=1 Head size?Not needed:its size is small.Grid dimensio
48、n dim3(Q_tiles,B,H)or dim3(B,H,Q_tiles)?Reuse KV working set on L2 cache.QP=Softmax(QKT)VOKTfor j in kv_loopfor j in kv_loopfor i in q_loopNote:44Best cuda programming practice based on FABlock-level optimization How many warps in one block?(TLP)Ideally,we want as many warps as possible to handle la
49、rge tile size(reuse buffers in the shared memory).However,we need to consider that:It will be limited by register/shared memory resources.We want to overlap memory access and compute.Assume we use 4 warps,how should we map them to different Q tiles and KV tiles.Q_tile warps x KV_tile warps:1 x 4 or
50、4 x 1 or 2 x 2?Avoid inter-warp communication(softmax statistics reduction,split-k reduction).QP=Softmax(QKT)VOKTfor j in kv_loopfor j in kv_loopfor i in q_loopNote:45Best cuda programming practice based on FABlock-level optimization Q/KV Tile Size(ILP)Assume we have 4x1 warps,the tile unit size wou
51、ld be 64x16(MMA.16816).Larger Q tile size Reuse KV tile in the SMEM.Make sure we can fully utilize SMEM BW.Larger KV tile size Make sure we can fully utilize SMEM BW.Limited by registers finally.Overlap memory access and compute.QP=Softmax(QKT)VOKTfor j in kv_loopfor j in kv_loopfor i in q_loopNote:
52、46Best cuda programming practice based on FAThread/Instruction level optimization Reduce number of instructions Fast math instructions:=FMUL+MUFU.EX2 Otherwise,we need 10 instructions for just Dont need to re-normalize by the latest normalization factor(1)move to the epilogue of BMM2 reduced number
53、of FMULs.2 optimization that reduces tons of FMULs by fusing that with FADDs into FFMA.Base:bmm1_scale FMUL-max FADD()2(2)FMUL+MUFU.EX2 New:scaled_max=max*bmm1_scale FMUL(reused)bmm1_output*bmm1_scale scaled_max FFMA 2 MUFU.EX2Note:47Best cuda programming practice based on FAThread/Instruction level
54、 optimizationNote:KV TilesQ Tiles Causal mask 1.7x 1.8x speedup Skip masked KV tiles Only apply causal mask(check if it is OOB)in the last KV tile Other tricks:reuse values(like scale,bias).make computation more GPU friendly QK scale-convert it to(1.0f/sqrt(head size)-FMULs48Thank You!在 NVIDIA NeMo
55、中实现大语言学模型全周期开发以 LLaMa2 为例姚 鑫&颜 子 杰LLM Whole Life-cycle Development with NVIDIA NeMo FrameworkZijie Yan,Xin Yao|October 22,2023Introduction of NeMo FrameworkPretrainingFinetuning:SFT and PEFTRLHFTRT-LLMAgendaIntroduction of NeMo FrameworkNeMo FrameworkEnd-to-end,cloud-native framework to build,custom
56、ize and deploy generative AI modelsData Curation ScaleExtract,deduplicate,filter info from large unstructured data scaleOptimized TrainingAccelerate training and throughput by parallelizing the model and the training data across 1,000s of nodes.Model CustomizationEasily customize with P-tuning,SFT,A
57、dapters,RLHF,AliBiGuardrailsKeep applications aligned with safety and security requirements using NeMo GuardrailsDeploy at-scale AnywhereRun optimized inference at-scale anywhereMulti-modality supportBuild language,image,generative AI modelsSupportNVIDIA AI Enterprise and experts by your side to kee
58、p projects on trackAIDEVELOPERSENTERPRISE APPLICATIONSQueriesModel DevelopmentIn-domain queriesIn-domain,secure,cited responsesNeMo FrameworkData CurationDistributed TrainingModel CustomizationAccelerated InferenceGuardrailsGenerally availability with NVIDIA AI Enterprise Multi-modal available via e
59、arly access nowStructure of NeMo framework Most of the models are defined in Megatron-Core Low-level layers come from TE DistOpt and DP are implemented in ApexFeatures of NeMo Framework Parallelism strategies Tensor Parallelism Pipeline Parallelism Sequence Parallelism Selective Activation Recompute
60、 Distributed optimizer(ZeRO-1)Context Parallelism MoE Expert Parallelism Expert Tensor Parallelism Functionalities GQA/MQA Distributed checkpointing Offline/Online Export to TRT-LLM Hopper FP8 mixed-precision training via TENeMo LauncherIntroductionWhat problem it solves?Complexity of end-to-end dee
61、p learning workflowDifficulty in setting up optimized model configurationsWhat is the value added?It provides a user-friendly interface for launching end-to-end NeMo Framework workflows across multiple environments.OSS on GitHub:https:/ and customers can build workflows in their preferred environmen
62、tFully customizable end-to-end workflows:Cluster setup and configurationData downloading,curating,and processingModel parallel configurationModel trainingModel fine-tuning(SFT and PEFT)Model evaluationModel export and deploymentOptimized and Fully-tested model training recipesCustomizedConfiguration
63、sLauncherEntry PointLaunching Scripts GeneratorMulti-Platform LaunchersE2E workflowSupported Llama SeriesPretraining3D Parallelism Techniques To Build Foundation ModelNeMo framework offers efficient algorithms to train large-scale models Requires extensive experimentation to configure hyperparameter
64、s Needs state-of-the-art algorithms to process internet-scale data across an entire datacenter Llama Support in NeMo Group Query Attention(GQA)NeMo LauncherFor LLaMa Scripts and code to provide end-to-end data preparation and training for NeMo Framework https:/ Set cluster information Slurm or K8S A
65、ccount Pretraining/SFT/PEFT Stage Data path Number of nodes Automatically generate configs and submit jobsNeMo Launcher Auto Configurator Hyper parameter search for better throughput TP/PP/MBS act_ckpt_layers num_micro_batches_partial_act_ckpt act_ckpt_layers_per_pipeline Step 1:From a set of traini
66、ng and inference constraints,the tool finds the best model size for your needs.Step 2:Given a model size,the tool provides a good configuration:Learning rate,weight initialization,optimizer,weight decay,dropout,data type(fp16,bf16),global batch size Step 3:Given the config from step 2,the tool launc
67、hes a grid search to find the optimal way to train your model,considering your training and inference hardware constraints.How to launch auto-configurator for llama models?In auto_configurator/conf/config.yaml,set the search_config to llama/*b.yaml launcher configurations are stored in auto_configur
68、ator/conf/search_config/llama/base configurations are specified in launcher configurations as template,which are store in auto_configurator/base_configs/Perf NumbersA100 BF16/H100 BF16/H100 FP8 Compared to A100 BF16:GPT 5B:H100 BF16:2.3 2.4X H100 FP8:3.1 3.3X GPT 175B:H100 BF16:2.5X H100 FP8:3.3XFin
69、etuningSupervised FinetuningParameter Efficient FinetuningLLaMa SFT Guide Python commands(following this guide)Download and process the dolly dataset following the.Run the SFT command by appropriately setting the values for the parameters such as the number of steps,model checkpoint path,batch sizes
70、 etc.For a full reference of parameter settings refer to the config file Write scripts and start SFT.Run evaluation.You could also launch SFT job with NeMo Launcher In conf/config.yaml,set the fine-tuning configuration to llama/squad,which is stored in conf/fine-tuning/llama/squad.yamlNeMo Framework
71、 Supervised fine-tuning(SFT)with Llama2Step 1Convert The Instruction Dataset(s)https:/ is HiFI?High fidelity(often shortened to Hi-Fi or HiFi)is the high-quality reproduction of,output:Bell Laboratories began experimenting with a range of recording techniques in the early 1930s.,category:information
72、_extraction“Step 2SFT Traininghttps:/ Finetuning GuidePEFT(Parameter Efficient Finetune)RLHFSupervised Fine Tuning of LLMTrain Reward Model with Human FeedbackReinforcement Learning Pipeline with Human Feedback10k-100K prompt-responses as input Fine tune LLM using prompt and responses100k-1M respons
73、es ranked and ratedreward model trained to mimic human feedback of model generated responses to promptsBuild pipeline with RLHF to continuously improve model over time4 models:PPO Policy Network,PPO Value Network,Reward Model,Initial PolicyReinforcement Learning from Human Feedback(RLHF)Open-source,
74、scalable and distributed library to fine-tune LLMs of any size using RLHFHumans Rating ResponsesReward(Preference)ModelPromptRewardReward ModelPPOResponse123NeMo-RLHF Stage 3(PPO)Architecture 4 Models interact with each other in separate nodes:RM doing inference.Initial Policy doing inference.PPO Va
75、lue/Actor network doing training&inference.PPO Policy network is running as a standard NeMo job and sends data to the servers as needed over HTTP.ActorRunning PPO Jobs on Selene:Heterogeneous JobsOptimizing the Performance of PPO RL requires switching between generation and training in a loop.Genera
76、tion is slow on LLMs,up to 90%of our E2E time Leverage TRT-LLM for its impressive generationperformance.BottleneckOptimizing the Performance of PPONeMo TRT-LLM Integration(Coming soon)Features:Efficient weight update.OOTB TRT-LLM inference optimizations.Memory optimizations.Support Data Parallel,Ten
77、sor Parallel,Pipeline Parallel and 3DParallel.Allows changing pipeline size in inference to maximize performance.Ease of use:Can be invoked with just 3 lines of code.Automatically reuses the parallelism settings used during training.Automatically builds and updates the TensorRT engine.Key Takeaways:
78、Up to 8x E2E Speedup:Achieve a substantial performance improvement compared to torch based generation.Scalability:Scaling RLHF to large clusters and Massive ModelsOptimizing the Performance of PPOTRT-LLM as the generation backend(Coming soon)RLHF E2E Acceleration TestE2E Time(s)Gen.Time(s)Gen./E2ELL
79、aMa-7Bbaseline0.3680.34794.3%w/TensorRT-LLM0.0930.07277.4%Speed Up3.964.82LLaMa-13Bbaseline0.3140.28289.8%w/TensorRT-LLM0.0950.06366.3%Speed Up3.314.48LLaMa-65Bbaseline0.8030.74292.4%w/TensorRT-LLM0.1920.13168.2%Speed Up4.185.66GPT-8Bbaseline0.4740.45596.0%w/TensorRT-LLM0.0540.03564.8%Speed Up8.7813
80、.00GPT-22Bbaseline0.5150.44987.2%w/TensorRT-LLM0.1050.03937.1%Speed Up4.9011.51Preliminary performance projection,subject to change.Optimizing the performance of PPO RL requires switching between generation and training in a loop Generation is slow on LLMs,up to 90%of our E2E time Optimizations in N
81、eMo-RLHF To speedup generation Adam state offloading Distributed log prob gradient flow TRT-LLM as the generation backend(Coming Soon)To reduce idling Combining reward model and critic server with CPU on/offload Combining initial policy and actor with CPU on/offloadInference w/TRT-LLMExport to TRT-L
82、LM Export and Deploy in a few lines of code!trt_llm_exporter=TensorRTLLM(model_dir=model_infotrt_llm_model_dir)trt_llm_exporter.export(nemo_checkpoint_path=model_infocheckpoint,n_gpus=1)output=trt_llm_exporter.forward(test1,how about test 2)Coming SoonComing Soon FSDP Scaling sequence length to 128k
83、 4D parallelism(TP+PP+CP+DP)Mixture of Experts(MoE)New Alignment methods Rejection SamplingThank You!TensorRT Hackathon 2023 总结AIGC及大语言模型推理的典型案例深入解析季 光&陈 庾87TensorRT Hackathon 2023总结季光,GPU加速计算团队经理|2023/10/2288TensorRT Hackathon简介 TensorRT Hackathon是NVIDIA主办的国内编程年度比赛 动机 服务选手,实地教学:分享TensorRT编程的最佳实践 积累
84、成果,扩大影响:建设TensorRT开发者生态 非动机:招募志愿者帮NVIDIA开发一个功能89初赛(7.118.14)初赛为固定赛题:优化Stable Diffusion+ControlNet全流水线 与阿里天池合作,借用平台的招募与内容发布设施 评估程序基于“画质”与“时延”评分,即时发布在排行榜上 为方便选手学习TensorRT编程,NVIDIA更新了在线TensorRT视频教程 共五期视频,时长近4小时(搜索TensorRT 8.6教程)初赛结束后,有参考实现的讲评(搜索TensorRT 2023 初赛总结)建立选手讨论群,可与NV工程师交流 NV关注排行榜得分情况,不断放出提示,促进
85、选手提高分数90复赛(8.179.21)复赛为自由赛题:选择TensorRT或TensorRT-LLM,在主办方提供的云主机上开发 可以优化一个模型 可以在模型上添加功能(如weight-only量化,int8 KV-cache,smooth quant等)要求代码可运行,自带性能与精度评测程序 要求写技术报告 评分标准 主要得分:选题+代码+技术报告 附加得分:提交bug+使用特定编程与优化技巧 各组选手与导师建立微信小群,得到专属导师指导 导师职责 监督选手进度:在给定时间点上进行选题讨论、中期审查与最终审查 技术答疑 评测选手成果 确认与提交bug 评分 选8名导师组成评分委员会 各评委
86、独立评分后,取平均分 与初赛分加权相加,得最终排名91赛况 729支队伍参赛 85支队伍提交了有效的初赛程序 1750轮代码提交,每次提交都被自动评分 40支队伍进入复赛 入围分数为5462,远超过最初期望的4000分(最简单的实现大约2700分)复赛成果 约20支队伍在复赛中完成了模型开发与优化 优秀作品有极高的质量 在选手同意之后,可被集成到TensorRT-LLM 产出大量的代码示例、技术文档与经验分享,并被公开 提交了若干有效的bug9293总结 赛事特点:重视技术交流 赛程长,持续两个多月 选手以及导师投入大:选手既要写代码,还要写报告 导师既要答疑,又要确认成果,提交bug 鼓励选
87、手间的技术分享与讨论 鼓励选手与NVIDIA工程师互动 NVIDIA的投入:为促进选手技术进步而创造条件 主要成本:大批工程师兼职导师,辅导选手 次要成本:赛事后半程提供云主机的费用+与天池合作的费用 其余成本:少量现金奖励 节省成本:无线下答辩会,无颁奖典礼 些许遗憾 复赛没有搞到A100 手误把初赛讨论群解散了94特别鸣谢阿里天池的支持!欢迎明年参加TensorRT Hackathon 2024!95Accelerating LLaMA Inference with Quantization in TensorRT-LLM陈庾97How to build&run LLaMA in Tens
88、orRT-LLMHow to quantize LLaMA in TensorRT-LLMAccuracy and performanceAgenda98How to build&run LLaMA in TensorRT-LLMHow to quantize LLaMA in TensorRT-LLMAccuracy and performanceAgenda99 HOW TO BUILD&RUN LLAMA IN TRT-LLMTensorRT-LLMexamplesllamabuild.pyrun.pysummarize.pyBuild the TRT engines of LLaMAR
89、un the TRT engines with pythonThe summarization task to validate the accuracy of modelsweight.pyLoad weight&quantize weightLLaMA example100 HOW TO BUILD&RUN LLAMA IN TRT-LLMTensorRT-LLM/tensorrt_llm/models/llama/model.pyDefine the model with TRT-LLM APITensorRT-LLM/tensorrt_llm/models/quantized/quan
90、t.pyReplace fp16 ops with quantized opsWeight Only Quantization:Per-channel INT8/INT4、AWQ-INT4、GPTJ-INT4INT8 SmoothQuantFP8Load Weight&Quantize WeightBuild TRT Engine101 HOW TO BUILD&RUN LLAMA IN TRT-LLMTRT-LLM APIs based on TRT python API TRT customized PluginModel Definition102 HOW TO BUILD&RUN LL
91、AMA IN TRT-LLMTensorRT-LLM/tensorrt_llm/functional.pyTRT-LLM APIs based on TRT python APIsTRT customized pluginTRT-LLM APIs103 How to build&run LLaMA in TensorRT-LLMHow to quantize LLaMA in TensorRT-LLMAccuracy and performanceAgenda104 HOW TO QUANTIZE LLAMA IN TRT-LLMReplace FP16 gemms with weight o
92、nly gemms recursively,without affecting other ops Weight Only Quantization105 HOW TO QUANTIZE LLAMA IN TRT-LLMINT8 KV CacheEnable INT8 KV cache with flag use_int8_kv_cache106 HOW TO QUANTIZE LLAMA IN TRT-LLMINT8 SmoothQuantUse specific ops to smooth and deal with different quantization strategies107
93、 HOW TO QUANTIZE LLAMA IN TRT-LLMFP8 QuantizationUse TRT Q/DQ workflowCast inside FP8Linear/FP8RowLinear108 How to build&run LLaMA in TensorRT-LLMHow to quantize LLaMA in TensorRT-LLMAccuracy and performanceAgenda109 Quantization Granularity of INT8 SmoothQuant&FP8ACCURACY AND PERFORMANCEQuantizatio
94、n Granularity:INT8 SmoothQuant:Per-tensor;Per-Token+Per ChannelFP8 :Per-tensorMethodrouge1rouge2rougeLrougeLsumHuggingface FP1622.987.8016.3520.10TRT-LLM INT8 SmoothQuantPer-Tensor6.991.026.326.59TRT-LLM INT8 SmoothQuantPer-Token+Per-Channel23.047.8716.8320.38TRT-LLM FP8Per-Tensor22.927.7116.7320.04
95、Accuracy test of LLaMA 7B on 1xH100 with summarize.py(max_ite=200)110 Weight Only Quantization ImplementationACCURACY AND PERFORMANCEWeight Only Quantization Algorithms:Per-channel INT8 weight only:one scale for one channelPer-channel INT4 weight only:one scale for one channelGPTJ INT4 weight only:g
96、roupwise,support zero&bias,one scale for one group,groupsize=64/128AWQ INT4 weight only:groupwise,support zero&bias,one scale for one group,groupsize=64/128FP16+INT*Kernels Type:Cutlass kernel:FP16 Tensor Core,enable when m=SMALL_M_FAST_PATHCuda kernel :Cuda Core,enable when m SMALL_M_FAST_PATHSMALL
97、_M_FAST_PATH=5111 Weight Only Quantization AccuracyACCURACY AND PERFORMANCEAccuracy test of LLaMA 7B on 1x H100 with summarize.py(max_ite=200)Methodrouge1rouge2rougeLrougeLsumHuggingface FP1622.987.8016.3520.10TRT-LLM INT8 weight onlyPer-Channel23.127.9616.5120.31TRT-LLM INT4 weight onlyGPTJ23.768.2
98、617.0620.73TRT-LLM INT4 weight onlyAWQwith Quantized lm_head23.248.0517.2320.21TRT-LLM INT4 weight onlyPer-Channel22.227.0916.0019.43TRT-LLM INT4 weight onlyPer-Channelkeep mlp.fc as FP16 gemm22.977.6216.5619.84112 PerformanceACCURACY AND PERFORMANCEspeedup vs TRT-LLM FP16 of LLaMA 7B on 1x A100-80G
99、B with(input seqlen,output seqlen)=(256,256)the higher the betterTHANK YOU!向量数据库的加速策略和实战王 雍&张 静 蓉Approximate Nearest Neighbor(ANN)Search on GPUAccelerating the Engine of Vector DatabaseYong Wang,Compute DevTech|Oct 22,2023Vector DatabaseApproximate Nearest Neighbor(ANN)SearchRAPIDS RAFThttps:/ resul
100、tsBrute-forceGEMM+Sorting#include cublas_v2.h#include cub/cub.cuhGEMM+TopKParallel TopK3 4 6 1 5 8 2 7index 0 1 2 3 4 5 6 7Parallel TopKinputs3 4 6 1 5 8 2 7index 0 1 2 3 4 5 6 7 resultson-the-fly processingtime itime i+1.4 6158 2.3 4 6 1 3 41 3 46TopK(GEMM)Brute-Force1M 10M 100M 1B#vectordim:100 dt
101、ype:floatIVF-FlatInverted File(IVF)Index Building4749345343 4151504854 37 52 6042574638C23359 393544 45 36563228409361826999678892806671 6863 757 2 6 5 9785 70 94777974C362 96869584907383 76 91 8987 988178643155583271518 5221 12306 4 9 1 9 C 1 1 1 2 5 1 62123 20 108 24262717141329Search4749345343 41
102、51504854 37 52 6042574638C23359 393544 45 36563228409361826999678892806671 6863 757 2 6 5 9785 70 94777974C362 96869584907383 76 91 8987 98817864315558q3271518 5221 12306 4 9 1 9 C 1 1 1 2 5 1 62123 20 108 24262717141329Search4749345343 4151504854 37 52 6042574638C23359 393544 45 3656322840936182699
103、9678892806671 6863 757 2 6 5 9785 70 94777974C362 96869584907383 76 91 8987 98817864315558q3271518 5221 12306 4 9 1 9 C 1 1 1 2 5 1 62123 20 108 24262717141329Balanced Hierarchical k-MeansBalanced Hierarchical k-Means balanced cluster sizes 100M vectors,10%for training,100,000 clusters,2 minutestrai
104、ning sete.g.10%subset1.train meso-clustermeso-cluster 1 meso-cluster 2 meso-cluster 32.train clusters in meso-clustercluster 1 cluster 2 cluster 3 cluster 4 cluster 5 cluster 6 cluster 7 cluster 8 cluster 9cluster 1 cluster 2 cluster 3 cluster 4 cluster 5 cluster 6 cluster 7 cluster 8 cluster 93.tra
105、in clusters without meso-cluster4.fit whole dataset to clusterswhole setBrute-Force1M10MIVF-Flat100M1B#vector dim:100 dtype:floatIVF-PQProduct QuantizationProduct Quantization(PQ)x=x1,.,xD/M,xD/M+1,.,x2D/M,.,xD-D/M+1,.,xDx(1)x(2)x(M)c(1)c(2).c(M)i j hcluster centroidsi j.h PQ-codeti titi ti ti titi
106、ti ti titi tiCodebook Lookup table Distance computatione.g.if PQ-code of x is 2,.,kd(y(1),c(1)1d(y(1),c(1)212)d(y(1),c(1)1d(y(1),c(1)2)d(y(M),c)d(y(M),c(M)1(M)2)d(y(1),c(1)k)d(y(M),c(kM)d(y(1),c(1)k)d(y(M),c(M)k)C(1)C(M)c1(1)c2(1)ck(1)c(M)1c(M)2c(M)k)d(y(M),(M)c)d(y(M),(M)cIVF-PQ in RAFT arbitrary n
107、umber of PQ dimensions 4,8-bits per dimension lookup talbe optimizations:shared memory,custom FP8Brute-Force1M10MIVF-Flat100MIVF-PQ1B#vector dim:100 dtype:floatCAGRACuda Anns GRAph-based arXiv:2308.15136974 68q3125974 68q3125candidates:1 result:1 visited:1974 68q3125candidates:2,3 result:2visited:1,
108、2,3974 68q3125candidates:5,4,3 result:5visited:1,2,3,4,5974 68q3125candidates:6,4,3 result:6visited:1,2,3,4,5,6974 68q3125candidates:8,7,4 result:8visited:1,2,3,4,5,6,7,8974 68q3125candidates:9,7,4 result:8visited:1,2,3,4,5,6,7,8,9Search in CAGRAGraph-based method on GPU is good for low latency larg
109、e batch:forgettable hash table,can be fit in shared memory small batch:multi-CTA to increase parallelismProximity Graph in CAGRA k-nearest neighbor graph graph optimization:rank-based instead of distance-basedNN-DescentNearest Neighbor Descent avoid all-pairs distance computation 100M vectors,10 min
110、utesCAGRABrute-Force1M10MIVF-Flat100MIVF-PQ1B#vector dim:100 dtype:floatBenchmarkfor GPU Implementationshttps:/ Database push the performance limit on Single-Node Single-GPU demystify ANN for GPU vs.CPUKey Takeawayshttps:/ algorithms Brute-force IVF-Flat IVF-PQ GAGRAUtilities:Top-K Balanced Hierarch
111、ical k-Means NN-Descent for kNN Graph Benchmark for GPUAcknowledgements Akira Naruse Artem Chirkin Christina Zhang Mahesh Doijade Hiroyuki Ootomo Ray Wang Tams FehrParallel Top-K Algorithms on GPU:A Comprehensive Study and New MethodsChristina Zhang,Devtech|Devtech AI OpendayTop-K ProblemDefinition
112、Top-K problem:Find the smallest(or largest)elements in a list Following features should be supported:Batching Index returns with value On-the-fly processing(optional)8614Find the top-2 smallest elements from 6 elementsApproximate Nearest Neighbor(ANN)Network SparsifyDrug DiscoveryDatabase1213Contrib
113、utionsWelcome to our talk for further details ContributionsShared QueueWarp Queue01 31Merging NetworkWhen the shared queue is full01 31 AIR Top-K GridSelectComprehensive BenchmarkKnown GPU Parallel AlgorithmsSortingTop-K AlgorithmsSortingPartial SortingMethodsPartition-basedMethodsHybrid Methodshttp
114、s:/nvlabs.github.io/cub/structcub_1_1_device_radix_sort.html The most straightforward one is sorting Parallelly sorts all elements and then extracts the first items.cub:DeviceRadixSort provides device-wide,parallel operations for computing a radix sort across a sequence of data items residing within
115、 device-accessible memory.Known GPU Parallel AlgorithmsPartial Sorting Methods:WarpSelect WarpSelect maintains a thread queue for each thread and there are 32 thread queues in total for a warp.When any thread queue is full,a bitonic sorting is carried out to sort all thread queues and then a bitonic
116、 merging is used to merge thread queues into the results.To simplify the introduction,we use 4 threads to process one case.Known GPU Parallel AlgorithmsPartial Sorting Methods:WarpSelect Then warpSelect will perform sorting on the current warp queue and the thread After sorting,the warp queue will o
117、nly keep the k least elements in the warp queue,i.e.10,11,12,13.Known GPU Parallel AlgorithmsPartial Sorting Methods:WarpSelect Now we can update the current Kth smallest element based on the result of the warp queue.In this figure,its value is 13.Known GPU Parallel AlgorithmsPartial Sorting Methods
118、:WarpSelect So,we will perform one sorting so that we can update the current K smallest elements in the warp queue.And then we can update the Kth smallest element.Here we use an example to show how the parallel radix top-K works.Find the top =4 elements from a list of =9 elementsKnown GPU Parallel A
119、lgorithmsPartition-based Methods:Radix SelectCompute histogramCompute inclusive prefix sum of the histogramFind target digitFilteringCandidates for next pass Here we use an example to show how the parallel radix top-K works.Find the top =4 elements from a list of =9 elementsCompute histogramCompute
120、inclusive prefix sum of the histogramFind target digitFilteringCandidates for next passKnown GPU Parallel AlgorithmsPartition-based Methods:Radix Select Find the top =4 elements from a list of =9 elementsFirst PassSecond PassCompute histogramCompute inclusive prefix sum of the histogramFind target d
121、igitFilteringKnown GPU Parallel AlgorithmsPartition-based Methods:Radix SelectAIR Top-K:Adaptive and Iteration-fused Radix Top-KContributions Iteration-fused Design Adaptive Strategy Based on Data Distribution Early StoppingGridSelect:Optimize WarpSelect with Shared Queue and Multiple thread blocksC
122、ontributions Shared QueueShared QueueWarp Queue01 31Merging NetworkWhen the shared queue is full01 31Parallel two-step insertionMultiple BlocksBlock-0 Block-1 Block-N The running time with respect to varying from 23 to 220and a constant GridSelect are more likely to be faster than AIR Top-when 256 A
123、IR Top-is the fastest one for other casesThe performance variation with increased and a constant (32,256,32768)If you are interested to our AIR Top-K and GridSelect,please join us in the next discussion session.More detailed experiments data and how to use it will be introduced.Thank You!推荐系统的最新优化策略
124、和实践 以HPS为例 魏 英 灿&王 泽 寰 推荐系统的最新优化策略和实践-以HPS为例魏英灿 王泽寰 10/22/2023BasicsHierarchical Parameter Server(HPS)GPU Embedding Inference Cache Evaluation and PerformanceAgendaBasics Most embedding recall datasets exhibit some sort of power law distribution.Frequent embeddings:Small set;accessed in almost every
125、,or every other input batch.Stochastic embeddings:Reoccur somewhat regularly.Rare embeddings:Large set;appear occasional.Number of rare embedding recalls per input batch is low.Usually,not a fixed distribution!RecSys query distributionAccess pattern during inferenceBasicsA GPU-specialized Inference
126、Parameter Server for Large-Scale Deep Recommendation ModelsHierarchical Parameter Server(HPS)Hierarchical Embedding StorageHigh Performance CachingConcurrent Inference ExecutionExtensible BackendsEasy DeploymentHierarchical Parameter Server Modular hierarchical storage layer architecture to extend G
127、PU memoryHierarchical Parameter ServerGPU 4Lookup SessionEmbedding CacheGPU 3Lookup SessionEmbedding CacheGPU 2Lookup SessionEmbedding CacheCPU Memory(e.g.,Redis,HashMap,)GPU 1Lookup SessionSSD(e.g.,RocksDB,)Level 2Embedding CacheLevel 3Level 1Level 1:Persistent Database(PDB)Use hard-disks or SSD st
128、orage volumes.Full copy of entire embedding table or cache for remote PS replicationLevel 2:Volatile Database(VDB)System memory can be extended at lower costs than GPU memory.Act as an async cache to PDBSupport distributed deploymentLevel 3:Embedding Inference CacheLight-weight sync mechanism allow
129、simultaneous query+insert/update at any time.Optimized difference implementation for customized use case(Static,Dynamic,UVM).Can be shared by multiple inference instances.Decoupled from VDB and PDBNote:in EC static embedding mode(G+H ComputeX Benchmark),Level 1 will not be used,andEmbedding Cache/GP
130、U can access CPU Memory in Level 2 directly as option.A GPU-specialized Inference Parameter Server for Large-Scale Deep Recommendation ModelsGPU Embedding Inference CacheEmbedding Cache Implementation CategoriesDevice-lock ImplementationHost-lock ImplementationLock-free ImplementationGPU Embedding I
131、nference Cache(EIC)A Basic Cache Model:Set-associative Embedding CacheFigure 1.Set-associative Embedding Cache Structure000110001101111100010011set_idx=Hash(key)=2 tag set 2embeddingsnew_key:42tagset index101010hitembeddingsGPU Embedding Inference Cache(EIC)Device-lock ImplementationhitFigure 2.GPU
132、Embedding Cache Data Model with Device-lockGPU Embedding Inference Cache(EIC)Device-lock Implementationnum_sets indicates the number of tag sets,which is a user-set valueThe term“tag”here is not strictly consistent with the tag in the CPU cache.Set_idx=key%num_setskeytag setembeddingsoutput embeddin
133、gSet_idx=Hash(key)%num_setslock the tag setunlock the tag setcountershttps:/ only block queries in the same tag set.Most queries are not impacted.Easy to implement.Cons:Locking and unlocking increases time consumption(several times slower for key-to-indices).Queries into the same tag set will conten
134、d for locks.De-duplication of all keys,but this introduces additional time consumption.GPU Embedding Inference Cache(EIC)Host-lock ImplementationGPU Embedding Inference Cache(EIC)Host Lock ImplementationCons:The modification operation must be performed on the CPU.Additional synchronization is requir
135、ed to ensure the correctness of the looks-up.Suitable for the case with infrequent modifications.Set_idx=key%num_setskeytag setembeddinghitoutputset_idx=key%num_setskeytag setembeddingsAcquire the Host RW lockunlock the RW lockhitoutput embeddinghttps:/ Modification does not block the query kernel.O
136、ffload the counter storage to the Host The query kernels can run at full speed.GPU Embedding Inference Cache(EIC)Lock-free ImplementationGPU Embedding Inference Cache(EIC)Number of ways and size of tag for Lock-free Embedding CacheWays in tag set is number of tags in a tag setA key will be mapped to
137、 a tag set and then compared with all the tags in it012345678-way tag setAssuming 64 bits per tag,then each tag set has 64 bytes.Trade-off on size of tag set:Small tag set(8 ways and 8 ways and 64 bytes):High hit rate with less hash collisions.It increases tag latency.It increases more memory usage.
138、Access Granularity(bytes)Latency on H100(ms)Latency on A100(ms)Latency on A10(ms)320.2660.4791.42640.2660.4851.461280.3580.6312.31Access granularity below 64 bytes does not reduce latency.64 bytes per tag set should be a better choice.Latency around 8M Key accesses per batch:GPU Embedding Inference
139、Cache(EIC)Lock-free implementation for Modify/Inference WorkflowSignal ObjectInference0Inference2Inference4Inference1Inference3Calc CandidatesWait For ObjectInference5Inference7Inference6Inference8InvalidateModify()SetReplaceData()Wait for inf0-40224-234-2341234Commithttps:/ and PerformanceEmbedding
140、 Cache performance with different NVIDIA GPUsEvaluationFigure 3.Speedup of Device-lock vs Lock-free embedding cache for Criteo Dataset1xDataset:CriteoPHM:Parallel HashmapCPU:AMD EPYC 7742,16 core,32 threads,SMTHPS E2E performance with different NVIDIA GPUsEvaluationFigure 4.Speedup of HPS E2E vs CPU Workflow for TenRec DatasetDataset:TenrecCPU Workflow:PHM embedding lookup+Embedding vector H2DCPU:AMD EPYC 7742,16 core,32 threads,SMTThank You!幸 运 抽 奖与 NVIDIA 专家技术讨论及答疑悦府 10厅悦府 11厅悦府 12厅开放区 左侧(悦府 12厅对面)开放区 右侧(悦府 10厅对面)