mon*_*mon 4 monitoring automation kubernetes kubernetes-health-check kubernetes-helm
如果是 K8S,监控节点运行状况文档中提到了节点问题检测器。如果GCE中没有,我们如何使用它?它是否向仪表板提供信息或提供 API 指标?
“该工具旨在使集群管理堆栈中的上游层可以看到各种节点问题。它是一个在每个节点上运行的守护进程,检测节点问题并将其报告给 apiserver。”
呃,好吧,但是...这实际上意味着什么?我如何判断它是否到达了 api 服务器?
之前和之后是什么样子的?知道这一点将帮助我理解它在做什么。
在安装节点问题检测器之前我看到:
Bash# kubectl describe node ip-10-40-22-166.ec2.internal | grep -i condition -A 20 | grep Ready -B 20
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
NetworkUnavailable False Thu, 20 Jun 2019 12:30:05 -0400 Thu, 20 Jun 2019 12:30:05 -0400 WeaveIsUp Weave pod has set this
OutOfDisk False Thu, 20 Jun 2019 18:27:39 -0400 Thu, 20 Jun 2019 12:29:44 -0400 KubeletHasSufficientDisk kubelet has sufficient disk space available
MemoryPressure False Thu, 20 Jun 2019 18:27:39 -0400 Thu, 20 Jun 2019 12:29:44 -0400 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Thu, 20 Jun 2019 18:27:39 -0400 Thu, 20 Jun 2019 12:29:44 -0400 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Thu, 20 Jun 2019 18:27:39 -0400 Thu, 20 Jun 2019 12:29:44 -0400 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Thu, 20 Jun 2019 18:27:39 -0400 Thu, 20 Jun 2019 12:30:14 -0400 KubeletReady kubelet is posting ready status
Run Code Online (Sandbox Code Playgroud)
安装节点问题检测器后,我看到:
Bash# helm upgrade --install npd stable/node-problem-detector -f node-problem-detector.values.yaml
Bash# kubectl rollout status daemonset npd-node-problem-detector #(wait for up)
Bash# kubectl describe node ip-10-40-22-166.ec2.internal | grep -i condition -A 20 | grep Ready -B 20
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
DockerDaemon False Thu, 20 Jun 2019 22:06:17 -0400 Thu, 20 Jun 2019 22:04:14 -0400 DockerDaemonHealthy Docker daemon is healthy
EBSHealth False Thu, 20 Jun 2019 22:06:17 -0400 Thu, 20 Jun 2019 22:04:14 -0400 NoVolumeErrors Volumes are attaching successfully
KernelDeadlock False Thu, 20 Jun 2019 22:06:17 -0400 Thu, 20 Jun 2019 22:04:14 -0400 KernelHasNoDeadlock kernel has no deadlock
ReadonlyFilesystem False Thu, 20 Jun 2019 22:06:17 -0400 Thu, 20 Jun 2019 22:04:14 -0400 FilesystemIsNotReadOnly Filesystem is not read-only
NetworkUnavailable False Thu, 20 Jun 2019 12:30:05 -0400 Thu, 20 Jun 2019 12:30:05 -0400 WeaveIsUp Weave pod has set this
OutOfDisk False Thu, 20 Jun 2019 22:07:10 -0400 Thu, 20 Jun 2019 12:29:44 -0400 KubeletHasSufficientDisk kubelet has sufficient disk space available
MemoryPressure False Thu, 20 Jun 2019 22:07:10 -0400 Thu, 20 Jun 2019 12:29:44 -0400 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Thu, 20 Jun 2019 22:07:10 -0400 Thu, 20 Jun 2019 12:29:44 -0400 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Thu, 20 Jun 2019 22:07:10 -0400 Thu, 20 Jun 2019 12:29:44 -0400 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Thu, 20 Jun 2019 22:07:10 -0400 Thu, 20 Jun 2019 12:30:14 -0400 KubeletReady kubelet is posting ready status
Run Code Online (Sandbox Code Playgroud)
请注意,我请求帮助想出一种方法来查看所有节点的这一点,Kenna Ofoegbu 提出了这个超级有用且可读的宝石:
zsh# nodes=$(kubectl get nodes | sed '1d' | awk '{print $1}') && for node in $nodes; do; kubectl describe node | sed -n '/Conditions/,/Ready/p' ; done
Bash# (same command, gives errors)
Run Code Online (Sandbox Code Playgroud)
好的,现在我知道节点问题检测器是做什么的,但是...向节点添加条件有什么好处,如何使用该条件来做一些有用的事情?
问题:如何使用 Kubernetes 节点问题检测器?
使用案例 #1:自动修复中断的节点
步骤 1.) 安装节点问题检测器,以便它可以将新的条件元数据附加到节点。
步骤 2.) 利用 Planetlabs/draino 来封锁和排空条件恶劣的节点。
步骤 3.) 利用https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler进行自动修复。(当节点被封锁并耗尽时,它将被标记为不可调度,这将触发新节点的配置,然后坏节点的资源利用率将非常低,从而导致坏节点被取消配置)
来源: https: //github.com/kubernetes/node-problem- detector#remedy-systems
使用案例#2:显示不健康的节点事件,以便 Kubernetes 检测到该事件,然后将其注入到您的监控堆栈中,以便您拥有事件发生和时间的可审计历史记录。
这些不健康的节点事件记录在主机节点上的某个位置,但通常情况下,主机节点会生成大量嘈杂/无用的日志数据,因此默认情况下通常不会收集这些事件。
节点问题检测器知道在主机节点上的何处查找这些事件,并在看到负面结果的信号时过滤掉噪音,并将其发布到其 Pod 日志中,该日志没有噪音。
Pod 日志可能会被摄取到 ELK 和 Prometheus Operator 堆栈中,在那里可以对其进行检测、发出警报、存储和绘制图表。
另请注意,没有什么可以阻止您实现这两个用例。
更新,在评论中为每个请求添加了一段 node-problem- detector.helm-values.yaml 文件片段:
log_monitors:
#https://github.com/kubernetes/node-problem-detector/tree/master/config contains the full list, you can exec into the pod and ls /config/ to see these as well.
- /config/abrt-adaptor.json #Adds ABRT Node Events (ABRT: automatic bug reporting tool), exceptions will show up under "kubectl describe node $NODENAME | grep Events -A 20"
- /config/kernel-monitor.json #Adds 2 new Node Health Condition Checks "KernelDeadlock" and "ReadonlyFilesystem"
- /config/docker-monitor.json #Adds new Node Health Condition Check "DockerDaemon" (Checks if Docker is unhealthy as a result of corrupt image)
# - /config/docker-monitor-filelog.json #Error: "/var/log/docker.log: no such file or directory", doesn't exist on pod, I think you'd have to mount node hostpath to get it to work, gain doesn't sound worth effort.
# - /config/kernel-monitor-filelog.json #Should add to existing Node Health Check "KernelDeadlock", more thorough detection, but silently fails in NPD pod logs for me.
custom_plugin_monitors: #[]
# Someone said all *-counter plugins are custom plugins, if you put them under log_monitors, you'll get #Error: "Failed to unmarshal configuration file "/config/kernel-monitor-counter.json""
- /config/kernel-monitor-counter.json #Adds new Node Health Condition Check "FrequentUnregisteredNetDevice"
- /config/docker-monitor-counter.json #Adds new Node Health Condition Check "CorruptDockerOverlay2"
- /config/systemd-monitor-counter.json #Adds 3 new Node Health Condition Checks "FrequentKubeletRestart", "FrequentDockerRestart", and "FrequentContainerdRestart"
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
4688 次 |
| 最近记录: |