《NVIDIA 的 BERT 量化方法与工具.pdf》由会员分享,可在线阅读,更多相关《NVIDIA 的 BERT 量化方法与工具.pdf(54页珍藏版)》请在三个皮匠报告上搜索。
1、NVIDIAQUANTIZATION METHODS AND TOOLSFOR BERTHongxiaoBai,2020/12#page#AGENDAINT8 inference and quantizationBasic conceptsQuantization and calibration methodsMethods to determine scales and weightsWorkflow and network structureHow to quantize a model (BERT)Results and further improvementAbout accuracy
2、 and performance#page#INT8 INFERENCEAND QUANTIZATION#page#INT8 INFERENCEWhy need INT8 inferenceNowadays, the latency and throughput of inference is criticalThe complexity of models is increasing exponentially, such as BERTINT8 can speed up inference greatly:NVIDIAT4NVIDIA A100 Speed up computation31
3、2TFLOPSPeak FP16 Tensor Core65TFLOPSUp to 2X peak performance speedup compared with FP16Speed up memory access130 TOPS624TOPSPeak INT8 Tensor CoreHalf the bandwidth requirement compared with FP16#page#INT8 INFERENCEHow to do INT8 inferenceIf we want to do inference in INT8 precision, we need:Do quan
4、tization to get a model that can be inferenced in INT8 precisionThis talk focuses on this, especially for BERTDo inference with the quantized modelICNS20306The INT8 Quantization of FasterTransformer3.0 EncoderImplementtation andinference details of BERTINT8inference in FasterTransformer3.0#page#QUAN
5、TIZATIONWhy need quantization-Weights and activations are float numbers in a small rangeUnlike FP32-FP16,it cannot be directly cast from FP32/FP16 to INT8Dynamic rangeMin positive valueFP32-3.4*1033.4*10351.4*104FP16-65504-655045.96*103INI-128-1271Quantization:convert float model to INT8 without sii
6、gniffiicant accuracy loss#page#QUANTIZATIONWhatisquantization-Quantize: map the FLOAT values to discrete INT values using linear/non-linear scaling techniques.Dequantize: recover FLOAT values from INT values.Quantization object:convert from high precision to low precision with minimal information lo