1、Thermal-Safe Operation for GPU under Power Constraints Using Reinforcement LearningNational Taiwan UniversityThermal-Safe Operation for GPU under Power Constraints Using Reinforcement LearningYu-Han Chiu Tsung-Kuang LiaoJia-Han LiNational Taiwan UniversityJie-Hong HouChien-Er LaiShih-Wen ChenChao-Ch
2、ing HoNational Taipei University of TechnologyHung-Hsuan LinDelta Electronics,Inc.FUTURE TECHNOLOGIES SYMPOSIUMOutline4321IntroductionMethodologyResults and DiscussionConclusionBackground With the rapid growth of machine learning and generative AI tasks in recentyears,the demand for GPU throughput f
3、rom both individuals and enterprises hassignificantly increased.To deliver greater computational power,it has become necessary to add moreprocessing units,which in turn raises the power consumption design of GPUs.IntroductionMotivation Cooling Limit:IntroductionAir cooling in a tall 1U(1.75 inches)c
4、hassis can dissipate only about250 W,and even in a 2U chassis,only up to 500 Walreadyapproaching the limit.To further increase power,liquid cooling becomes necessary.As it needs quick response for AI server,it decreases the thermalbuffer time,narrows the reaction window,and increases the risk ofover
5、heating shutdowns.Motivation Cooling Failure Has Become One of the Major Risks:IntroductionAccording to data center incidentreports,cooling system failure isthesecondleadingcauseofunexpected downtime.Whenairflowisobstructedorexternalcoolingisinterrupted,GPU temperatures can rise sharplywithin a shor
6、t time.Figure 1.Ratio of Shutdown CausesD.Donnellan and A.Lawrence.Annual outage analysis 2024:The causes and impacts of IT and data center outages(executive summary).Technical Report 131,Uptime Institute Intelligence,New York,NY,Mar.2024.Motivation Fan Failure Experiment:IntroductionThrottling at 8