《大规模液冷GB200的实现.pdf》由会员分享,可在线阅读,更多相关《大规模液冷GB200的实现.pdf(19页珍藏版)》请在三个皮匠报告上搜索。
1、Wenying Zhang,Cheng ChenMeta Platforms,Inc.Enablement of Liquid Cooled GB200 At ScaleEnablement of Liquid Cooled GB200 At ScaleWenying Zhang,Cheng ChenMeta Platforms,Inc.ARTIFICIAL INTELLIGENCE(AI)This presentation introduces how Meta managed to deploy LC GB200 at scale,and our vision into future LC
2、 hardware.The following topics will be discussed:Goals and Challenges Liquid Cooling Design Leakage Handling Ship Wet or Dry Risk Reduction Results Future Trend Need of Community WorkPreviewGoalsTime to production deploy X racks within a shortened NPI cycleEfficiency minimize the power spend on cool
3、ingReliability limit the disruption caused by leakage eventsChallengesFirst time for many things,at such scaleProduct validation,build and process development in parallelEverything could break;every leak could be falseGoals and Challenges Deploy with AALC at 2:1 ratio No site constraint Better effic
4、iencyAALC-air assisted liquid cooling;L2A side car HX Multiple inlets/outlets Lower pressure drop Lower erosion risk Rope type Leak Sensor Low lead time concern Lower false alarm risk Better serviceabilityLiquid CoolingDesignAALC#1AALC#2GB200 RackLiquid CoolingDesignRMCCompute TraysSwitch TraysCompu
5、te TraysHeat ExchangerRPUHeat ExchangerRPURMF-ColdRMF-HotDual rack linked6x racks bundleAir cooled data centerConsiderationsEverything could leak,but dont know where and how frequentlyEvery detection and protection mechanism could break;but we could build redundancyHardware mechanism over network,so
6、ftware or manual interventionToo late to update BMS,but never for RMCBMS-building management systemRMC-rack management controllerLeakage HandlingRMC is the center of leakage handlingManages both leak detection and responseAll connections are hard wiredMultiple tiers of leakage detectionsOperate with