1、AI Hardware&SystemsaiandsystemsMastering AI Cluster ManagementPhil Pokorny Phil Pokorny CTO Penguin Solutions AI Hardware&SystemsaiandsystemsConsider thisFor an organization to make effective use of an AI cluster,it is important to take into consideration the entire process of designing,building,dep
2、loying and managing the resource.At each step,a cluster for AI presents new and different challenges that even experienced IT team members may not have encountered before.AI Hardware&SystemsaiandsystemsAgendaExplore AI clusters from design to daily management Key considerations when designing an AI
3、cluster Cooling Power How software complexities factor into cluster operations Cluster start up Day to day management Change control 2024 Penguin Solutions.All Rights Reserved.POWER IS THE FOUNDATION FOR AI10 kWCPU Rack(40 Nodes)13 kWA100 Rack(2 Nodes)22 kWH100 Rack(2 Nodes)H100 Rack(4 Nodes)44 kW12
4、0 kWB100 Rack(GTC)AI Hardware&SystemsaiandsystemsAir cooling limitations and the move to liquid or immersion cooling Commonly accepted limit of 30kW per rack on air Requires careful hot/cold aisle sealing Wider cold and hot aisles for more airflow Liquid cooling can handle 10 x more power per rack P
5、enguin offers a variety of server designs Traditional air cooled Direct to chip liquid cooled“Born to be immersed”designs Custom designed servers and cooling 100%capture,80/20 or 75/25 liquid/air cooling Material compatibility requires choosing a complete solutionAI Hardware&SystemsaiandsystemsPower
6、 and the need to move to 240 or 277V 120/208V is a problem for high-power servers Existing data center infrastructure Low current connectors(30 or 50 Amp)240/415V is an obvious first consideration Twice the power for the same current rating Compatible with all international AC power inputs 277/480V