《13-基于 Volcano 的拓扑感知调度方案在大规模 AI 工作负载与多样化网络集群中的应用 -Xiaodong YeYu Zhou.pdf》由会员分享,可在线阅读,更多相关《13-基于 Volcano 的拓扑感知调度方案在大规模 AI 工作负载与多样化网络集群中的应用 -Xiaodong YeYu Zhou.pdf(23页珍藏版)》请在三个皮匠报告上搜索。
1、Topology-Aware Scheduling for Large-Scale AI Workloads in Diverse Networks Clusters Using VolcanoXiaodong Ye,Moore ThreadsYu Zhou,Moore ThreadsBackground01Demo03Technical Details02ContentFuture Work04Challenges in Large-Scale GPU Cluster OperationsWe built MTT KUAE!(10,000+GPUs Cluster)Compute&Commu
2、nication EfficiencyMassive parallelism with over 10,000 GPUsOverlapping computation and communicationOptimized strategies:data,pipeline,tensor,and sequence parallelismStability&Fault ToleranceHigh risk of node/GPU/NIC failures during long-running jobsRobust health checks,checkpointing,and quick reco
3、very strategiesData Pipeline&I/O BottlenecksAvoiding redundant data readsAsynchronous preprocessing and shared memory usageNetwork&Resource SchedulingHigh-bandwidth network topology design and adaptive routingBalancing large jobs vs.small jobs in a multi-tenant environmentSystem-Level(Cluster Manage
4、ment)ChallengesScheduling&Resource AllocationGang scheduling for multi-node,multi-GPU jobsHandling diverse workloads(large jobs vs.small jobs)Health Monitoring&Fault IsolationPeriodic health checks for GPUs,PCIe,network,and storageAutomated job requeue and node remediation to prevent cascading failu
5、resIB/RoCE Network OptimizationDesigning efficient network topology and communication patternsAdaptive routing and congestion control to maintain low latencyScalability&MaintenanceMonitoring key metrics(MFU,ETTR,Goodput)Ensuring cluster expansion does not compromise reliability*From meta paperBusine
6、ss-Level(Training/Inference)ChallengesTraining Efficiency&Model ConvergenceCoordinating large-scale parallelism without compromising stabilityMitigating overhead from fault recovery and checkpointingInference Service PerformanceReal-time low-latency inference in distributed settingsBalancing trainin