当前位置：首页 > 报告详情

利用强化学习实现功耗受限下GPU的热安全运行.pdf

上传人：明**** 编号：1011498 2025-12-21 PDF PDF 26页 1.71MB

该报告所属合集： 2025年OCP亚太峰会（2025 OCP APAC Summit）嘉宾演讲PPT合集

打包下载报告合集

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载报告到电脑，查找使用更方便

VIP专享文档

书签

分享

收藏

已收藏

版权投诉

/26

立即下载

《利用强化学习实现功耗受限下GPU的热安全运行.pdf》由会员分享，可在线阅读，更多相关《利用强化学习实现功耗受限下GPU的热安全运行.pdf（26页珍藏版）》请在三个皮匠报告上搜索。

1、Thermal-Safe Operation for GPU under Power Constraints Using Reinforcement LearningNational Taiwan UniversityThermal-Safe Operation for GPU under Power Constraints Using Reinforcement LearningYu-Han Chiu Tsung-Kuang LiaoJia-Han LiNational Taiwan UniversityJie-Hong HouChien-Er LaiShih-Wen ChenChao-Ch

2、ing HoNational Taipei University of TechnologyHung-Hsuan LinDelta Electronics,Inc.FUTURE TECHNOLOGIES SYMPOSIUMOutline4321IntroductionMethodologyResults and DiscussionConclusionBackground With the rapid growth of machine learning and generative AI tasks in recentyears,the demand for GPU throughput f

3、rom both individuals and enterprises hassignificantly increased.To deliver greater computational power,it has become necessary to add moreprocessing units,which in turn raises the power consumption design of GPUs.IntroductionMotivation Cooling Limit:IntroductionAir cooling in a tall 1U(1.75 inches)c

4、hassis can dissipate only about250 W,and even in a 2U chassis,only up to 500 Walreadyapproaching the limit.To further increase power,liquid cooling becomes necessary.As it needs quick response for AI server,it decreases the thermalbuffer time,narrows the reaction window,and increases the risk ofover

5、heating shutdowns.Motivation Cooling Failure Has Become One of the Major Risks:IntroductionAccording to data center incidentreports,cooling system failure isthesecondleadingcauseofunexpected downtime.Whenairflowisobstructedorexternalcoolingisinterrupted,GPU temperatures can rise sharplywithin a shor

6、t time.Figure 1.Ratio of Shutdown CausesD.Donnellan and A.Lawrence.Annual outage analysis 2024:The causes and impacts of IT and data center outages(executive summary).Technical Report 131,Uptime Institute Intelligence,New York,NY,Mar.2024.Motivation Fan Failure Experiment:IntroductionThrottling at 8

word格式文档无特别注明外均可编辑修改，预览文件经过压缩，下载原文更清晰！

三个皮匠报告文库所有资源均是客户上传分享，仅供网友学习交流，未经上传用户书面授权，请勿作商用。

根据文章内容，以下是对全文主要内容的简明概括： 1. **背景与动机**：随着机器学习和生成式AI任务的快速增长，对GPU吞吐量的需求显著增加，但GPU功耗设计也面临挑战，尤其是冷却限制和冷却系统故障的风险。 2. **方法**：采用强化学习来优化GPU在功率限制下的热安全操作，通过控制功率限制（PL）来平衡温度安全和性能。 3. **实验结果**： - 在正常和冷却故障条件下，强化学习控制器（RL）在保持核心温度在目标±1°C范围内时，能够快速响应。 - 与PID控制器相比，RL控制器在冷却故障和恢复场景中表现出更快的动态控制速度，将调节时间缩短了56%。 - RL控制器在极端条件下也能有效控制温度和功率，将功率波动从30W降低到10W以下。 4. **结论**：强化学习控制器在保持热安全的同时，提高了GPU的性能和稳定性。

"强化学习如何保障GPU散热？" "GPU功率限制，RL如何优化？" "AI服务器，RL控温新突破？"

全行业研究报告分享下载平台

0731-84720580
商务合作：really158d
友链申请 (QQ)：1737380874

关于我们

更多

关于我们

三个皮匠报告微信公众号

三个皮匠报告微信小程序

扫码咨询网站充值下载问题

友情链接：

营销自动化亿欧智库微播易阿里妈妈

copyright@2008-2013 长沙景略智创信息技术有限公司版权所有网站备案/许可证号：湘B2-20190120 | 工信部备案号：湘ICP备17000430号-2 | 公安备案号：湘公网安备43010402001071号

客服

小程序

服务号

折叠