《GPU 推理中的数据格式全局优化算法.pdf》由会员分享,可在线阅读,更多相关《GPU 推理中的数据格式全局优化算法.pdf(25页珍藏版)》请在三个皮匠报告上搜索。
1、GPU推理中Tensor数据格式的全局优化Alibaba达摩院机器智能陈元凯11/12/2020#page#咨日01问题背景02数据格式优化算法03数据展示/未来工作#page#数据格式数据格式(memorylayout):tensor数据可以按照不同顺序排列N:Batch sizeC: Channel sizeHW:Feature map size537891011121314151624NCHW761431548121651310119NC/4HW4c主流框架(TF,onnxetc):NCHW为什么需要不同的数据格式不同的计算流程,多样的数据读写pattern新一代的GPU硬件:tenss
2、orcore#page#数据格式的影响10.4.5.Conversion Between NCHWAndNHWCNHWCTensor Cores require that the tensors be in thata layout.Conversion between NCHWand NHWC isperformed when the user requests Tensor Op mat.However,as stated in Basics,a request to use Tensor Coresis just that,a request and Tensor Cores may n
3、ot be used in some cases.The CuDNN library converts betweenNCHW and NHWC if and only if Tensor Cores are requested and are actually usedfyour input (and output)are NCHW,then expecta layout changehttps:/ NCHWINT8 NC4HW4自研计INT8 NC16HW16算库sINT8 NHWC推理引擎 FP16 NCHWFP16 NHWC3 party计算库?TVM#page#数据格式选择,数据格式
4、和算子性能:Workload的影响(device V100,kernel source CUDNN):Float16卷积13224224*64377NHWC: 0.29ms,NCHW:0.058msFloat16卷积11282828*12812833NHWC:0.045ms,NCHW:0.091ms,如果选择数据格式?选择单个算子最快!=模型整体最快数据格式之间转换需要额外的开销#page#1例子数据格式转换时间可能很耗时0.3msconv1最好的组合不见得对单个op最快0.3msconv1全局统筹的算法问题NCHWNCHW0.2msConvertTotal=1.0msTotal=NHWC0.
5、6msconv2conv20.5ms#page#问题综述问题描述输入:深度学习模型算子不同实现(cudnn,TVMetc),输出:模型中每个tensor的数据格式,优化目标:模型整体推理性能最快#page#2算法背景Optimizing CNN Model Inference on CPUsYhi LiuYaoWangRuofeiYMuLVn Sharma.YdaWangAmazon Web Servicesfyihiliwayao,Abstractsisting ofopcrations.In practice.pcople normally use highThe popularity o
6、f Convolutional Neural Network(CNN) mod-performance kernel ibraries(e.g.IntelMKL-DNN127andcls and the ubiquity of CPUs imply that bettcr performanceofOpenBlas S1J) to obtain high performance for CNN operaCNN model inference on CpUs can deliver significant gaintions.While these libraries tune very ca