《统一的AIOps用于远程管理异构开源AI系统.pdf》由会员分享,可在线阅读,更多相关《统一的AIOps用于远程管理异构开源AI系统.pdf(18页珍藏版)》请在三个皮匠报告上搜索。
1、Muthukkumaran Ramalingam()Rami Radi()Unified AIOps for Remote Management of Heterogeneous Open-Source AI SystemsUnified AIOps for Remote Management of Heterogeneous Open-Source AI SystemsMuthukkumaran Ramalingam()Rami Radi()ARTIFICIAL INTELLIGENCE(AI)AbstractModern AI infrastructure demands intellig
2、ent coordination across open-source firmware,diverse hardware platforms,and fragmented telemetry sources.This unified AIOps framework leverages open standards like Redfish and IPMI,along with OpenBMC,to enable remote management of GPU-intensive environments at scale.It provides fine-grained access t
3、o system metrics such as GPU utilization,thermal states,and cooling performance.Multi-vendor telemetry is normalized through AIOps pipelines to enable predictive analytics and automated remediation workflowsranging from firmware patching to adaptive cooling control.An AI-powered chatbot interface si
4、mplifies operations through natural language interaction.Built in alignment with OCPs principles,this scalable approach enhances performance,reduces carbon impact,and supports resilient AI infrastructure management.AI Infrastructure ComponentsComputeServersStorageGPUsPower&CoolingPDUs,Power ShelvesC
5、DUsInterconnectNVLink SwitchesFabricsRacksManagement&ControlBMCsFirmwareMonitoring APIsChallengesMANAGING HETEROGENEOUS ENVIRONMENTHANDLING HUGE AMOUNT OF TELEMETRY DATAEXPOSE THE DATA IN DIFFERENT WAYSVENDOR LOCKED SOLUTIONSUnified AIOps ModelTelemetry CollectionAI Ops PipelinePredictive AnalyticsP
6、olicy Driven RemediationClosed loop OptimizationsMulti Vendor DataServers,GPUs,CDUs,PDUsNormalize&ingest via Redfish/APIsDetect Host SpotsGPU IssuesInefficienciesPatch FirmwareMigrate WorkloadsAdopt CoolingOptimize performance/wattReduce Carbon Impact.Overall WorkflowComponent NComponent 3Component