Era*_*ray 8 amazon-web-services kubernetes amazon-eks
将 AWS EKS 与 t3.medium 实例一起使用,所以我有(2 个 VCPU = 2000 个内核和 4gb ram)。
使用这些 cpu 请求定义在节点上运行 6 个不同的应用程序:
name request replica total-cpu
app#1 300m x2 600
app#2 100m x4 400
app#3 150m x1 150
app#4 300m x1 300
app#5 100m x1 100
app#6 150m x1 150
Run Code Online (Sandbox Code Playgroud)
通过基本的数学计算,我可以说整个应用程序消耗 1700m cpu 内核。此外,我的 hpa 为app#1和app#2 的cpu 限制为 60% 。所以,我希望只有一个节点,或者可能有两个节点(因为 kube-system pod),但集群总是运行 3 个节点。看起来我理解自动缩放是错误的。
$ kubectl top nodes
NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
ip-*.eu-central-1.compute.internal 221m 11% 631Mi 18%
ip-*.eu-central-1.compute.internal 197m 10% 718Mi 21%
ip-*.eu-central-1.compute.internal 307m 15% 801Mi 23%
Run Code Online (Sandbox Code Playgroud)
如您所见,它仅使用了 10-15% 的节点。如何优化节点扩展?有3个节点的原因是什么?
$ kubectl get hpa
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
app#1 Deployment/easyinventory-deployment 37%/60% 1 5 3 5d16h
app#2 Deployment/poolinventory-deployment 64%/60% 1 5 4 4d10h
Run Code Online (Sandbox Code Playgroud)
更新 #1
我有 kube-system pod 的 pod 中断预算
kubectl create poddisruptionbudget pdb-event --namespace=kube-system --selector k8s-app=event-exporter --max-unavailable 1
kubectl create poddisruptionbudget pdb-fluentd --namespace=kube-system --selector k8s-app=k8s-app: fluentd-gcp-scaler --max-unavailable 1
kubectl create poddisruptionbudget pdb-heapster --namespace=kube-system --selector k8s-app=heapster --max-unavailable 1
kubectl create poddisruptionbudget pdb-dns --namespace=kube-system --selector k8s-app=kube-dns --max-unavailable 1
kubectl create poddisruptionbudget pdb-dnsauto --namespace=kube-system --selector k8s-app=kube-dns-autoscaler --max-unavailable 1
kubectl create poddisruptionbudget pdb-glbc --namespace=kube-system --selector k8s-app=glbc --max-unavailable 1
kubectl create poddisruptionbudget pdb-metadata --namespace=kube-system --selector app=metadata-agent-cluster-level --max-unavailable 1
kubectl create poddisruptionbudget pdb-kubeproxy --namespace=kube-system --selector component=kube-proxy --max-unavailable 1
kubectl create poddisruptionbudget pdb-metrics --namespace=kube-system --selector k8s-app=metrics-server --max-unavailable 1
#source: https://gist.github.com/kenthua/fc06c6ea52a25a51bc07e70c8f781f8f
Run Code Online (Sandbox Code Playgroud)
更新 #2
发现第 3 个节点并不总是处于活动状态,k8s 缩小到 2 个节点,但几分钟后,再次扩大到 3 个节点,然后一次又一次地缩小到 2 个节点。kubectl 描述节点
# Node 1
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 1010m (52%) 1300m (67%)
memory 3040Mi (90%) 3940Mi (117%)
ephemeral-storage 0 (0%) 0 (0%)
attachable-volumes-aws-ebs 0 0
# Node 2
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 1060m (54%) 1850m (95%)
memory 3300Mi (98%) 4200Mi (125%)
ephemeral-storage 0 (0%) 0 (0%)
attachable-volumes-aws-ebs 0 0
Run Code Online (Sandbox Code Playgroud)
更新 #3
I0608 11:03:21.965642 1 static_autoscaler.go:192] Starting main loop
I0608 11:03:21.965976 1 utils.go:590] No pod using affinity / antiaffinity found in cluster, disabling affinity predicate for this loop
I0608 11:03:21.965996 1 filter_out_schedulable.go:65] Filtering out schedulables
I0608 11:03:21.966120 1 filter_out_schedulable.go:130] 0 other pods marked as unschedulable can be scheduled.
I0608 11:03:21.966164 1 filter_out_schedulable.go:130] 0 other pods marked as unschedulable can be scheduled.
I0608 11:03:21.966175 1 filter_out_schedulable.go:90] No schedulable pods
I0608 11:03:21.966202 1 static_autoscaler.go:334] No unschedulable pods
I0608 11:03:21.966257 1 static_autoscaler.go:381] Calculating unneeded nodes
I0608 11:03:21.966336 1 scale_down.go:437] Scale-down calculation: ignoring 1 nodes unremovable in the last 5m0s
I0608 11:03:21.966359 1 scale_down.go:468] Node ip-*-93.eu-central-1.compute.internal - memory utilization 0.909449
I0608 11:03:21.966411 1 scale_down.go:472] Node ip-*-93.eu-central-1.compute.internal is not suitable for removal - memory utilization too big (0.909449)
I0608 11:03:21.966460 1 scale_down.go:468] Node ip-*-115.eu-central-1.compute.internal - memory utilization 0.987231
I0608 11:03:21.966469 1 scale_down.go:472] Node ip-*-115.eu-central-1.compute.internal is not suitable for removal - memory utilization too big (0.987231)
I0608 11:03:21.966551 1 static_autoscaler.go:440] Scale down status: unneededOnly=false lastScaleUpTime=2020-06-08 09:14:54.619088707 +0000 UTC m=+143849.361988520 lastScaleDownDeleteTime=2020-06-06 17:18:02.104469988 +0000 UTC m=+36.847369765 lastScaleDownFailTime=2020-06-06 17:18:02.104470075 +0000 UTC m=+36.847369849 scaleDownForbidden=false isDeleteInProgress=false scaleDownInCooldown=false
I0608 11:03:21.966578 1 static_autoscaler.go:453] Starting scale down
I0608 11:03:21.966667 1 scale_down.go:785] No candidates for scale down
Run Code Online (Sandbox Code Playgroud)
更新 #4
根据自动缩放器日志,它忽略了 ip-*145.eu-central-1.compute.internal 来缩小规模,出于某种原因,我想知道会发生什么并直接从 EC2 控制台终止实例。这些行出现在自动缩放器日志中:
I0608 11:10:43.747445 1 scale_down.go:517] Finding additional 1 candidates for scale down.
I0608 11:10:43.747477 1 cluster.go:93] Fast evaluation: ip-*-145.eu-central-1.compute.internal for removal
I0608 11:10:43.747540 1 cluster.go:248] Evaluation ip-*-115.eu-central-1.compute.internal for default/app2-848db65964-9nr2m -> PodFitsResources predicate mismatch, reason: Insufficient memory,
I0608 11:10:43.747549 1 cluster.go:248] Evaluation ip-*-93.eu-central-1.compute.internal for default/app2-848db65964-9nr2m -> PodFitsResources predicate mismatch, reason: Insufficient memory,
I0608 11:10:43.747557 1 cluster.go:129] Fast evaluation: node ip-*-145.eu-central-1.compute.internal is not suitable for removal: failed to find place for default/app2-848db65964-9nr2m
I0608 11:10:43.747569 1 scale_down.go:554] 1 nodes found to be unremovable in simulation, will re-check them at 2020-06-08 11:15:43.746773707 +0000 UTC m=+151098.489673532
I0608 11:10:43.747596 1 static_autoscaler.go:440] Scale down status: unneededOnly=false lastScaleUpTime=2020-06-08 09:14:54.619088707 +0000 UTC m=+143849.361988520 lastScaleDownDeleteTime=2020-06-06 17:18:02.104469988 +0000 UTC m=+36.847369765 lastScaleDownFailTime=2020-06-06 17:18:02.104470075 +0000 UTC m=+36.847369849 scaleDownForbidden=false isDeleteInProgress=false scaleDownInCooldown=false
Run Code Online (Sandbox Code Playgroud)
据我所知,该节点没有缩小,因为没有其他节点适合“app2”。但是app内存请求是700Mi,目前其他节点有足够的内存供app2使用
$ kubectl top nodes
NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
ip-10-0-0-93.eu-central-1.compute.internal 386m 20% 920Mi 27%
ip-10-0-1-115.eu-central-1.compute.internal 298m 15% 794Mi 23%
Run Code Online (Sandbox Code Playgroud)
仍然不知道为什么自动缩放器不将 app2 移动到其他可用节点之一并缩小 ip-*-145。
请求是容器的保证量。所以调度器不会将pod调度到没有足够容量的节点上。在您的情况下,2 个现有节点已经限制了它们的内存(0.9 和 0.98)。所以ip-*-145不能缩小,否则app2无处可去。
| 归档时间: |
|
| 查看次数: |
829 次 |
| 最近记录: |