当前位置:首页 > 报告详情

利用强化学习实现功耗受限下GPU的热安全运行.pdf

上传人: 明**** 编号:1011498 2025-12-21 26页 1.71MB

1、Thermal-Safe Operation for GPU under Power Constraints Using Reinforcement LearningNational Taiwan UniversityThermal-Safe Operation for GPU under Power Constraints Using Reinforcement LearningYu-Han Chiu Tsung-Kuang LiaoJia-Han LiNational Taiwan UniversityJie-Hong HouChien-Er LaiShih-Wen ChenChao-Ch

2、ing HoNational Taipei University of TechnologyHung-Hsuan LinDelta Electronics,Inc.FUTURE TECHNOLOGIES SYMPOSIUMOutline4321IntroductionMethodologyResults and DiscussionConclusionBackground With the rapid growth of machine learning and generative AI tasks in recentyears,the demand for GPU throughput f

3、rom both individuals and enterprises hassignificantly increased.To deliver greater computational power,it has become necessary to add moreprocessing units,which in turn raises the power consumption design of GPUs.IntroductionMotivation Cooling Limit:IntroductionAir cooling in a tall 1U(1.75 inches)c

4、hassis can dissipate only about250 W,and even in a 2U chassis,only up to 500 Walreadyapproaching the limit.To further increase power,liquid cooling becomes necessary.As it needs quick response for AI server,it decreases the thermalbuffer time,narrows the reaction window,and increases the risk ofover

5、heating shutdowns.Motivation Cooling Failure Has Become One of the Major Risks:IntroductionAccording to data center incidentreports,cooling system failure isthesecondleadingcauseofunexpected downtime.Whenairflowisobstructedorexternalcoolingisinterrupted,GPU temperatures can rise sharplywithin a shor

6、t time.Figure 1.Ratio of Shutdown CausesD.Donnellan and A.Lawrence.Annual outage analysis 2024:The causes and impacts of IT and data center outages(executive summary).Technical Report 131,Uptime Institute Intelligence,New York,NY,Mar.2024.Motivation Fan Failure Experiment:IntroductionThrottling at 8

word格式文档无特别注明外均可编辑修改,预览文件经过压缩,下载原文更清晰!
三个皮匠报告文库所有资源均是客户上传分享,仅供网友学习交流,未经上传用户书面授权,请勿作商用。
根据文章内容,以下是对全文主要内容的简明概括: 1. **背景与动机**:随着机器学习和生成式AI任务的快速增长,对GPU吞吐量的需求显著增加,但GPU功耗设计也面临挑战,尤其是冷却限制和冷却系统故障的风险。 2. **方法**:采用强化学习来优化GPU在功率限制下的热安全操作,通过控制功率限制(PL)来平衡温度安全和性能。 3. **实验结果**: - 在正常和冷却故障条件下,强化学习控制器(RL)在保持核心温度在目标±1°C范围内时,能够快速响应。 - 与PID控制器相比,RL控制器在冷却故障和恢复场景中表现出更快的动态控制速度,将调节时间缩短了56%。 - RL控制器在极端条件下也能有效控制温度和功率,将功率波动从30W降低到10W以下。 4. **结论**:强化学习控制器在保持热安全的同时,提高了GPU的性能和稳定性。
"强化学习如何保障GPU散热?" "GPU功率限制,RL如何优化?" "AI服务器,RL控温新突破?"
客服
商务合作
小程序
服务号
折叠