1、如何稳定高效地利用 k8s 集群资源Shopee 云原生技术专家/李鹤我们是谁东南亚,台湾,巴西领先的电子商务平台我们是谁Google Play所有购物 App 中用户总花费时间第一所有购物 App 中平均月活第二所有 App 中最佳品牌第五强大的品牌认知度,持续增长个人简介kubernetes,karmada member2016 今,since v1.4集群管理,编排调度,资源利用率优化GithubID:likakuli订阅号:云原生散修blog:https:/内容简介数据驱动能力增强差异化混部弹性伸缩01020304浪费量化风险量化Insight Store调度重调度基于时区混部差异化S
2、LO混部资源预测CAHPAKubernetes in Shopee10+数据中心200+集群20K+节点500K Pod数据驱动浪费量化风险量化insight store数据驱动浪费量化浪费量化https:/blog.betacat.io/post/2023/05/explain-latency-and-utilization-using-queueing-theory/100%?FragmentationArch1 12 23 3Buffer4 4浪费量化cessccsa容量评估模拟调度集群压缩风险分析kluster-capacity浪费量化风险量化kluster-capacity sa-t
3、hresholds=50,60,75,70,75,80-snapshot=simulationresult.json-metric-url=https:/prometheus.url-range-start=2024-04-04 00:00:00-range-end=2024-04-04 23:59:59-step=60-g=100风险量化风险量化Insight StoreInsight Store能力增强调度Orgnizer调度重调度重调度workload,node,ns,cluster workload,node,ns,cluster 级别限流级别限流全局黑名单全局黑名单+特定特定 Ann
4、otation Annotation 禁用驱逐禁用驱逐安全安全eunomia agenteunomia agent为热点为热点nodenode设置设置annotation,descheduler watch nodeannotation,descheduler watch node变化变化定期执行定期执行+实时触发实时触发实时实时3K+nodes 50k+pods3K+nodes 50k+podsPercycle P99 Percycle P99 5m+5m+5s5s性能性能eunomia agent eunomia agent 预测节点短期负载变化预测节点短期负载变化,平滑处理毛刺平滑处理毛
5、刺预测预测混部基于时区混部for every usage class:sum(usage of all pods)node.Allocatable*safety threshold差异化SLO混部ProdGuaranteedProdGuaranteedReserved CPUSetCPU and memory NUMA alignmentUnconditionally suppress Mid BatchHighly critical services,control plane componentsProdBurstableProdBurstableShare CPUs with other
6、 ProdBurstableUnconditionally suppress Mid BatchStateless web serversProdRelaxedProdRelaxedSuppress Mid BatchDaemonSet servicesMidMidRelatively stable resourcesSuppress BatchInternal web services,non-business critical servicesBatchBatchDyna