1、多集群环境中AI驱动的故障诊断AI-Driven Troubleshooting in Multi-Cluster Environments闫猛(Meng yan)Software Engineer Red Hat目录多集群管理OCM概述01多集群中Agent设计03Agent介绍02Content样例展示04多集群管理-OCM概述Part 01Open Cluster Management多集群管理平台-Open Cluster Managementv Kubernetes Multi-Cluster Orchestration:CNCF Sandbox Projectv Architect
2、ure:Hub-Spoke,derived from the Hub-Kubelet pattern in Kubernetes,aligning with its native designv Scalability:Offloads workload to Spoke clusters via agent pullingv Robustness:Klusterlet and Hub operate independently and autonomouslyv Modularity and Extensibility:Pluggable design for customization a
3、nd further developmentv Example:Placement enables dynamic cluster selection and supports extension or replacement for advanced orchestration.v More Detail:Open Cluster Management DocumentAgent介绍Part 02ABM-ML-LLMAgent 介绍智能模拟策略学习深度学习:高维度决策Rule-Based AgentHeuristic AgentDeep Reinforcement Learning Agen
4、tAgent 介绍-GenAI:LLMv ReAct:Synergizing Reasoning and Acting in Language Models(2022)v MemGPT:Towards LLMs as Operating Systems(2023)v Retrieval-Augmented Generation(2020)学习经验Memory专业知识Model专业知识SearchObs推理:CoT交互:ActionEnvAct多集群中Agent设计Part 03Open Cluster Management+Multi-Agent Modeling多集群中Agent的设计动机
5、Motivation多集群线上发生故障时,因为时区等问题,专业工程师无法及时响应具备一些背景知识的工程师可以借助 Agentic Workflow 进行实时诊断与故障恢复,提高运维效率和系统稳定性LLM应用面临的挑战 Challenges准确性-幻视(Hullucination)可能导致错误决策领域知识-需要实时信息和专业知识的支撑安全性-需要严格控制操作权限,防止误用应对策略 Solutions提高准确性-ReAct(CoT),Multi-Agent System,Model Temperature,Model Type增强领域知识-Runbook,Search,RAG 保障安全性-权限控制
6、(Action Permission Control),从线上日志快照中获取集群上的资源信息多集群中Agent的设计问题1:怎样与多集群交互?How to Interact with Multiple Kubernetes Environments?工程师:分析用户意图,与多集群进行交互 Multicluster MCP Server-构建Open Cluster Management 与 GenAI 的桥梁 kubectl解释器:实现对资源的增删查改等各种操作 OCM-ManagedServcieAccount: