1、DataFunSummitDataFunSummit#20242024EasyRec推荐算法训练推理优化程孟力-阿里云-高级算法专家EasyRec训练推理架构EasyRec推理优化EasyRec训练优化Online Learning目录 CONTENTDataFunSummitDataFunSummit#2024202401EasyRec训练推理架构推荐模型的趋势和挑战趋势:特征越来越多:200 2000,大部分是交叉特征 Embedding越来越大:bfloat16 saves 50%memory;bfloat16对auc基本没有影响;tensorflow native bfloat16 t
2、o float is too slow;fp32bfp16bfp16_embedint8_embedis_valid_play_auc0.90080.90080.90080.8964is_like_auc0.95630.95630.95630.9463is_comment_auc0.92830.92830.92830.9154ln_play_time_mse0.88720.8870.88720.9007ln_play_time_mae0.58410.58420.58410.5939#define BF16_TO_FLT(ptr)_m128i fp16_i=_mm_loadu_epi16(voi
3、d const*)ptr);_m256i fp32_i=_mm256_cvtepu16_epi32(fp16_i);fp32_i=_mm256_slli_epi32(fp32_i,0 x10);vx256=(_m256)fp32_i;EasyRec推理优化:FeatureGenerator AVX StringSplit优化:HashMap优化:MurmurHash CrcHash(avx)XorHash(avx)RT(t99):-5%Fg算子化(tensorflow op):并行执行 复用tensorflow线程池 Overlap Execution 节省数据序列化开销RT(tp99):-2
4、0%QPS+20%SequenceFeature优化:item feature cache packed storage:-80%memoryEasyRec推理优化:FeatureGeneratorEasyRec推理优化:FeatureTileFind candidatesSelectConcatSort by DepthIsTiledTileYN Qps+(30%-50%)EasyRec推理优化:Placement优化EmbeddingDenseCPU:GPU:op数目多单个op计算量小kernel launch开销 op执行的时间(1-10微秒)(5-10微秒)MatMul计算量大op执行
5、的时间 kernel launch的开销 (100-2000微秒)Kernel Launch:H2D Memcpy:Find Min-CutSplitVInputEmbedding LookupLinearLinearLinearLinearConcatMLPCTRMin-CutEasyRec推理优化:XLA dense layer optimizationMarkForCompilationPassEncapsulateSubgraphsPassBuildXlaPassXlaCompilerAutoClusterTF2XlaNVPTXCompilerXla2CudaDynamic shape
6、 rt毛刺,编译cache溢出 服务pod启动时间长解决方法Warmup+AsyncCompileBucketize+PaddingPersistent CacheFuse elementwise operations:relu,batch_norm,sigmoid,XlaRunXlaAlignXlaSliceEasyRec推理优化:TRT(dense layer optimization)MatMulBatchNormAddGPU:op fusion to reduce kernel launchReshapeInput TensorCastBatch