1、Unlocking Heterogeneous AI Infrastructure K8s ClusterLeveraging the Power of HAMi MengXuan LiGithub archlitchi4ParadigmAbout usXiao Zhang software engineerGithub wawa0210 DaoCloudChanllange 1:Requirement for Computing Power is Growing Figure 1:Training Flops trendFigure 2:NVIDIA Flagship GPU for ML
2、AI technology has entered the stage of commercialization and requires more and more computing power.The demand for computing power for large langrage models can be quite exaggerated(375x/year)In order to match the trend of computing power growth,GPU manufacturers have released new GPUs rapidly,with
3、more powerful computing power,and higher price.Chanllange 2:Low Resource Utilization on GPUTwo factors lead to low utilization of GPU devices in k8s clusters:GPU resources can only be applied by container in an exclusive mannerIn order to match the trend of computing power growth,GPU manufacturers h
4、ave released new GPUs rapidly,with more powerful computing power,and higher price.A typical GPU utilization in GPU task in kubernetes:Core utilization can be 0 for In order to match the trend of computing power growth,GPU manufacturers have released new GPUs rapidly,with more powerful computing powe
5、r,and higher price.Chanllange 3:The demand for heterogeneous AI devices continues to growIn addition to Nvidia GPUs,there are also Cambricon,Hygon,iluvatar,Huawei Ascend AI devices.There are more and more AI smart devices.Unified orchestration scheduling and management will be very urgent.Shipments
6、exceeded 1.4 millionNvidia for 85%,Huawei 10%,Baidu 2%,and others 2%A K8s cluster has consistent management of multipleheterogeneous AI device nodes(NVIDIA,Cambricon,Hygon,NVIDIA,Cambricon,Hygon,iluvatar,Huawei Ascendiluvatar,Huawei Ascend).Device sharing(or device virtualization)on Kubernetes.Scena