Kar*_*arl 4 kubernetes prometheus-operator
我可以从 pod 描述中看到,我的 pod 由于内存压力而被“驱逐”而“失败”。但我如何使用普罗米修斯警报规则或其他方式测试太多“失败&&驱逐”pod?
我安装了 Prometheus Operator,我可以看到失败 Pod 的指标,但看不到失败和驱逐的指标
kubectl 描述 pod 给出:
Name: besteffort-evictme-001
Namespace: skyfii
Priority: 0
Node: ip-172-17-2-169.ap-southeast-2.compute.internal/
Start Time: Fri, 24 Sep 2021 15:28:53 +1000
Labels: <none>
Annotations: kubernetes.io/psp: eks.privileged
Status: Failed
Reason: Evicted
Message: The node was low on resource: memory. Container termination-demo-container was using 17165108Ki, which exceeds its request of 0.
IP:
IPs: <none>
Containers:
Run Code Online (Sandbox Code Playgroud)
普罗米修斯规则:
kube_pod_status_phase{phase="Failed"} > 0
Run Code Online (Sandbox Code Playgroud)
显示失败的 pod
kube_pod_status_phase{endpoint="http",instance="172.17.3.141:8080",job="kube-state-metrics",namespace="skyfii",phase="Failed",pod="besteffort-evictme-001",service="prometheus-kube-state-metrics"}
Run Code Online (Sandbox Code Playgroud)
但没有显示任何内容
kube_pod_container_status_terminated_reason{reason="Evicted"} > 0
Run Code Online (Sandbox Code Playgroud)
有任何想法吗?
谢谢卡尔
所以看来我需要更新我的kube-prometheus-stack舵图版本。
我们在 pod 描述中看到的“Evicted”Reason是挂在 podStatus 上的

新kube-prometheus-stack版本引入了 kube-state-metrics (v.2) 的更高版本 (v.2),从而公开了kube_pod_status_reason
我将升级然后重构我的普罗米修斯查询以使用这个新指标,并在它正常工作时发回答案。
欢呼卡尔
升级到 kube-prometheus-stack v 18.1.0 允许我这样做:-

这样我就可以制作我现在需要的查询
我通过将其添加到
prometheusAdditionalRulesMapkube-prometheus-stack 的 Values.yaml 部分,将其添加到我的 prometheus Alertmanager 规则中
- name: kubernetes-container-evictions
rules:
# Mem pressure evicted pods are left in a Failed state, alert if we see too many failed pods
# NB you will need to delete the failed pods after investigating
- alert: FailedEvictedPods
expr: sum by(namespace, pod) (kube_pod_status_phase{phase="Failed"} > 0 and on(namespace, pod) kube_pod_status_reason{reason="Evicted"} > 0) > 0
for: 10m
labels:
severity: warning
annotations:
message: 'Failed Evicted pod:{{ $labels.pod }} namespace:{{ $labels.namespace }}'
- alert: TooManyEvictedPods
expr: sum(kube_pod_status_reason{reason="Evicted"}) >= 2
labels:
severity: high
annotations:
message: 'Too many Failed Evicted Pods: {{ $value }}'
Run Code Online (Sandbox Code Playgroud)
现在我收到了我想要的警报:-)
| 归档时间: |
|
| 查看次数: |
6781 次 |
| 最近记录: |