1、名快手Hammer!:A UnifiedFramework of ModelCompression and NASSpeaker: Xiufeng Xie, Hongmin XuAI Platform, Seattle Al lab,and FeDA lab KwaiKey Contributors: Hongmin Xu Xiufeng Xie Jiianchao TanYiGuo Huan Yuan Jixiang Li Xiangru Lian and JiLiu#page#名快手ContributorsKey contributors#page#why Do We Need DNN M
2、odel Compression?名快手Speech recognitionBeauty photosFace unlock共国:1悦璃Online recommendation (iinference)#page#Goals in Model Compression名快手Model compressionAccuracyComputation efficiencyEnergy efficiency#page#Energy 8 Computing名快手Cost A LotExample: run an Al application that costs 100M FlopsOn ServerO
3、n Phone8Cost 1000 (5) per dayCost 1.8M (5) per daySource: 1 and assuming 300M DAU and 100 videos/user#page#名快手However. Model Size ls Not The Whole Story10%Model Size/FlopsEnergyLatencyDepend on the hardware#page#Model Compression Tools名快手ToolToolUnifiedModelModelNAS forHardwareNamePruningAuthorFrame
4、workQuantizationAwarenessCompressionKwaiXXXPocketFlowTencentSXXINNXMicrosoftX一DistillerIntelXXXPaddlePaddleBaiduXXXXGoogleXTensorflow LiteXXAIMETXQualcommXSXCondensaNvidiaHammer:JointPruning + Quantization +NASOthers:PruningQuantization#page#名快手Ubiquitous AI optimization with an allin-one framework电
5、中中AccuracymaxS Latency toleranceApplication requirementHardware resource budgetS Resource budgetAdd as many consttraints as the user wants#page#名快手Hammer! Supported1Compression Strategies88888aooo2(8oPruning(剪枝)(量化)QuantizationNeural architecture search时可电建中(NAS,神经结构搜索)#page#名快手Hammer! ls GPU-Hardwa
6、re-AwareSearch for a DNN structure that best fits GPU hardwareModular booster#input channelis multiple of N三贸#output channelis multiple ofNN=8LatencyOptimize DNNs running latency on the target GPUProfilingLatencyOptimizerModelEnergyOptimize DNNsenergy cost on the target GPUProfilingEnergyOptimizerMo