《Run.ai:2021年人工智能基础设施(AI Infra)现状报告(英文版)(15页).pdf》由会员分享,可在线阅读,更多相关《Run.ai:2021年人工智能基础设施(AI Infra)现状报告(英文版)(15页).pdf(15页珍藏版)》请在三个皮匠报告上搜索。
1、1The 2021 State of AI Infrastructure SurveyAll rights reserved to Run:ai.No part of this content may be used without express permission of Run:ai.www.run.ai2Large Teams and Big BudgetsBig Plans for AI and Limited Confidence8Demographics11Introduction and Key Findings3GPU Farm Size and Server Locatio
2、nsSize of Research Teams and Access to On-Demand GPU Compute as NeededGPU and AI Hardware Utilization and Resource Allocation IssuesCompanies of All Sizes Struggle with Hardware UtilizationTools Used to Optimize GPU Allocation Between UsersContainers and Kubernetes for AI WorkloadsCountry of Residen
3、ceCompany Size,Job Functions,Seniority and IndustryActionable Steps Based on the Key FindingsModels Making it to ProductionMain Challenges for AI DevelopmentPlans to Increase GPU Capacity or Additional AI InfrastructureConfidence in AI infrastructure Stack Set-up to Build,Train and MoveThis Guide Co
4、vers:All rights reserved to Run:ai.No part of this content may be used without express permission of Run:ai.www.run.aiMost research around the state of the Artificial Intelligence(AI)industry talks about the same few facts:AI is still very immature,models rarely make it to production,and challenges
5、remain for data scientists and research teams around creating the right infrastructure and setting up AI for success.To discover whether these pervasive ideas are still gospel in 2021,we commissioned a survey of 211 data scientists,AI/Machine Learning/IT practitioners and system architects from 10 c
6、ountries around the world.We spoke primarily with experts from large enterprise companies with over 5,000 employees,and some with as many as 10,000.We asked these enterprises to open up about the technologies they use,the challenges they face with AI and the size of not only their AI budget,but also
7、 their confidence in brining AIinto production.The survey was completed by independent research company Global Surveyz and took place in July 2021.The results are a fascinating look at the true state of AI maturity.We are working in a market with enormous potential.Three-quarters of those surveyed a
8、re looking to expand their AI infrastructure,and 38%have more than$1 million in annual budget to make that happen.However,big challenges definitely still exist,and many companies face early-stage hurdles with AI infrastructure setup,data preparation,and even goal setting.With so much invested in mak
9、ing AI successful,and companies looking to forge ahead and make progress,its clear that early adopters of the right technology have a lot to gain.management,IT,and finance.3IntroductionKey FindingsAll rights reserved to Run:ai.No part of this content may be used without express permission of Run:ai.
10、www.run.aiAI was clearly born with the cloud in mind,with 81%of companies working cloud-natively(using containers and cloud technologies)for their AI workloads.Along with the use of containers comes adoption of Kubernetes and other cloud-native tools for management of containers.A sizeable 42%of res
11、pondents are already on Kubernetes,another 13%on OpenShift,and 2%on Rancher/SUSE.These numbers are considerably higher than container adoption for non-AI workloads,making AI a leader in cloud-native adoption.AI is a cloud-native worldLack of confidence in AI infrastructure extends to hardware utiliz
12、ation,with more than 80%of surveyed companies not fully utilizing their GPU and AI hardware,and 83%of companies admitting to idle resources or only moderate utilization.Only 27%say that GPUs can be accessed on demand by their research teams as needed,with almost half of those who responded relying o
13、n manual requests for allocating compute resources.Infrastructure challenges weigh heavilyon AI teamsOur study shows that 38%of companies have a budget of more than$1M per year for AI infrastructure alone,and 59%have more than$250k per year.These huge budgets should indicate high confidence among th
14、e companies surveyed that they can get AI models into production.However,our survey found that for 77%of companies,less than half of models make it to production.Further,88%of companies say that they are not fully confident in their AI infrastructure set-up and arent sure that they can move their mo
15、dels to production in the timeline and budget provided.Big spenders,but a lack of confidenceThe top challenges for todays AI teams are data collection(61%),infrastructure/compute(42%)and defining business goals(36%).All three of the biggest challenges are early-stage problems for teams working with
16、AI,which speaks to market immaturity.In addition,the tools used to manage infrastructure for AI teams include home-grown tools(23%)and even Excel spreadsheets(16%)again showing that in many ways,AI still lacks maturity.AI is still a relatively immature market4All rights reserved to Run:ai.No part of
17、 this content may be used without express permission of Run:ai.www.run.aiAI challenges are relevant across all respondents,regardless of company size,industry,AI spend,or infrastructure location(cloud,hybrid,or on-premises).Infrastructure utilization is an issue for between 85%-90%of respondents,eve
18、n among companies that have$10M or more budgeted for AI each year.Despite this,most companies are not limiting their budgets until their challenges are solved,with 74%planning to increase spend on AI infrastructure in the next year.Budgets are growing,despite challengesThere is strong pressure on en
19、terprises to launch AI projects and to see value from Artificial Intelligence.While the challenges may still be early-stage issues like goal setting and infrastructure set-up,the spend is far from immature.The financial support is in place to propel AI projects,but it needs to be channelled to the r
20、ight places,improving the systems used for AI infrastructure management,solving hardware utilization challenges,and supporting research teams in gaining both confidence and access to resources.AI has enormous potential for those who beat the challengesLarge Teams and Big Budgets Dont Protect from Ha
21、rdware Utilization IssuesOver half of surveyed companies(53%)have GPU farms of 10 or more GPUs(figure 1),with a full 20%using over 100 GPUs for their AI research.This speaks to the considerable investment that has already been made in AI.Though the move to cloud from on-premises infrastructure is wi
22、dely discussed,that trend has yet to be fully realized in practice,with two-thirds(64%)hosting their GPU in the cloud or hybrid and a third running on-prem(figure 2).Over half(53%)already have all or some of their AI applications and infrastructure in the cloud,with another third(34%)planning to mov
23、e to the cloud in the coming years(figure 3).GPU Farm Size and Server LocationsCloud 42%Hybrid 22%On-prem 36%10 GPUsAlreadyon cloud10-50ThisyearNextyearIn five yearNoPlans51-100100+47%20%20%14%53%10%13%13%11%53%34%Figure 1:Size of GPU FarmFigure 2:GPU Servers LocationFigure 3:Plans for Moving AI App
24、lications and Infrastructure to the Cloud5All rights reserved to Run:ai.No part of this content may be used without express permission of Run:ai.www.run.aiAlmost two-thirds(63%)of companies have research teams of 10 or more,and yet only 27%of them have solved the need for fully on-demand access to G
25、PU compute.A larger research team doesnt equate to greater accessibility to compute resources.Figure 4:Size of Deep Learning Research TeamFigure 5:Do Research Teams Have On-Demand Access to GPU Compute?Over a third(35%)of research teams access GPU compute only via additional steps or static assignme
26、nt,and almost half of this group(43%)are subject to waiting for approval of their manual requests(figure 5).Every time they want to run a job,they need to make this manual request,slowing down operations and adding frustration and delay.Size of Research Teams and Access to On-Demand GPU Compute as N
27、eededVia ticketingsystem 19%By manualrequest 43%Assignedstatically tospecificusers/jobs 38%Yes,GPUs can beautomaticallyaccessed byanyone as needed 27%How GPUs are assigned whennot available on demandSomewhatGPUs can beautomatically accessedwithin set rules&limits 38%No35%10Researchers10-2425-4950-74
28、75-100100+37%20%10%4%7%22%63%6All rights reserved to Run:ai.No part of this content may be used without express permission of Run:ai.www.run.aiIssues with GPU/compute resource allocation were reported by 87%of respondents,with 12%saying issues happen often(figure 7).Figure 6:GPU and AI Hardware Util
29、izationFigure 7:Frequency of Experiencing GPU/Compute Resource Allocation IssuesAs a result,83%of surveyed companies are not fully utilizing their GPU and AI hardware.In fact,almost two-thirds(61%)indicated their GPU and AI hardware are mostly at moderate utilization(figure 6).GPU and AI Hardware Ut
30、ilization and Resource Allocation IssuesClose to 100%usage 17%Mostly idle or withlow utilization 22%Mostly at moderateutilization 61%Never 13%Rarely 39%Sometimes36%Often 12%7All rights reserved to Run:ai.No part of this content may be used without express permission of Run:ai.www.run.aiAt the high e
31、nd,38%of companies have an annual AI infrastructure budget of over$1 million(figure 8).When comparing budget against level of AI hardware utilization(figure 9),we see that companies with smaller budgets of up to$250k suffer the most from having their hardware sitting mostly idle.Over a third(35%)of
32、research teams access GPU compute only via additional steps or static assignment,and almost half of this group(43%)are subject to waiting for approval of their manual requests(figure 5).Every time they want to run a job,they need to make this manual request,slowing down operations and adding frustra
33、tion and delay.Companies of All Sizes Struggle with Hardware UtilizationFigure 8:Annual AI infrastructure Budget(Hardware,Software,Cloud)Up to 250k$Up to 250k$41%23%23%12%15%250k$-499k$250k$-499k$500k$-999k$500k$-999k$Figure 9:Annual AI infrastructure Budget by AI Hardware Utilization Close to 100%u
34、sage Moderate utilization Mostly idle1M$-9.9M$1M$-9.9M$10M$or more10M$or more59%38%45%14%10%10%40%68%80%71%36%48%14%10%14%18%18%8All rights reserved to Run:ai.No part of this content may be used without express permission of Run:ai.www.run.aiThe majority(72%)of companies are using one or more tools
35、to optimize their GPU allocation between users.From home-grown solutions(23%),to Excel spreadsheets(16%),these are mostly low-tech solutions,especially when considering the investment that isTools Used to Optimize GPU Allocation Between UsersFigure 10:Tools Used to Optimize GPU Allocation Between Us
36、ersHome-grownOpensource tools(Determined,Polyaxon,VolcanoExcel spreadsheets dont use any specifictools to manage GPU allocationHPC tools like Slurm or LSFSomething elseNot using any additional tools23%28%17%16%10%10%72%9All rights reserved to Run:ai.No part of this content may be used without expres
37、s permission of Run:ai.www.run.aiContainers are being used by 81%of companies for their AI workloads(figure 11)with Kubernetes ranking as the#1 container orchestration system,used by 42%of companies(figure 12).These numbers show that AI is born in cloud-native infrastructure,and has a far greater ad
38、option of cloud than the broader software world.Kubernetes is also ubiquitous among AI practitioners,with companies either using Kubernetes directly or leveraging managed K8s through a third-party.The use of orchestration tools shows that these companies are confident and mature in their use of cont
39、ainers.Containers and Kubernetes for AI WorkloadsFigure 11:Use of Containers for AI WorkloadsFigure 12:Container Orchestration Tools Used for AI Yes,all our AIworkloads are using containersKubernetesYes,most of our AIworkloads are using containersRedHat OpenshiftYes,some of our AIworkloads are using
40、 containersVMware TanzuSUSE RancherNoHPE Ezmeral Using Plan to use81%18%42%16%13%6%6%2%1%7%7%5%19%31%31%10All rights reserved to Run:ai.No part of this content may be used without express permission of Run:ai.www.run.aiBig Plans for AI,Despite Multiple Challenges and Limited ConfidenceLess than half
41、 of AI models make it to production for 77%of surveyed companies.Only 10%said 90%of their AI models make it to production.This aligns with the common AI challenges reported in various media outlets.Getting models to production remains an issue of much debate and stymied innovation.Models Making it t
42、o ProductionFigure 13:Models Making it to Production10%10%-24%25%-39%40-49%50-74%75-90%Over 90%20%21%14%10%8%5%23%77%11All rights reserved to Run:ai.No part of this content may be used without express permission of Run:ai.www.run.aiAn overwhelming 96%of companies admit to challenges when it comes to
43、 AI development.The top three challenges are data-related(61%),infrastructure/compute related(42%),or related to defining business goals(36%).These challenges crop up when getting started with AI and reflect the lack of market maturity.Despite the fact that many companies have a$1M budget in place,t
44、hey still arent sure how to measure success or how to collect suitable data.Infrastructure set-up is a significant challenge,as companies struggle with visibility and control.In general,the larger the company size,the greater the challenges becomeMain Challenges for AI DevelopmentFigure 11:Use of Co
45、ntainers for AI WorkloadsData-related challenges(data collection,cleansing,governance,pipelines)Infrastructure/Compute related challengesDefining businessgoalsMoving models to production(MLOpschallenges)Training-related challenges(model building,training time)Expense of doing AISomething elseWe have
46、 no AI development challenges65%46%39%43%53%36%30%23%23%23%21%16%15%4%4%26%5K+Employees 50%increase74%13All rights reserved to Run:ai.No part of this content may be used without express permission of Run:ai.www.run.aiCountry of Residence3%2%5%1%10%3%9%2%11%54%Figure 17:Country of Residence14All righ
47、ts reserved to Run:ai.No part of this content may be used without express permission of Run:ai.www.run.aiCompany Size,Job Functions,Seniority and IndustryFigure 18:Company Size 10,000ITSystems Admin/ArchitectSoftware DeveloperMLOpsDevOpsOtherML PlatformsML Engineer14All rights reserved to Run:ai.No
48、part of this content may be used without express permission of Run:ai.www.run.aiCompany Size,Job Functions,Seniority and IndustryOther 6%C Level 4%VP 3%SeniorDirector 2%Director13%Team Member50%Manager22%Figure 20:Job Seniority24%7%11%12%6%5%4%4%4%3%3%3%3%2%2%4%1%1%TechnologyLife SciencesProfessional ServicesOil&EnergyData Infrastructure,TelecomDefense&SpaceDiscrete ManufacturingProcess ManufacturingRetail/eCommerceMedia,Creative IndustriesAgriculture,Forestry,MiningSupply Chain&LogisticsHospitality,Food,LeisureOtherConstructionFinancial ServicesHealthcareEducationFigure 21:Industry