主机故障时 Pod 未移动

Question

主机故障时 Pod 未移动

the*_*ill 4 kubernetes

我根据“Kuberenetes Up & Running”一书结合官方文档，为自己设置了一个简单的 1 个主节点和 3 个节点设置，该设置在 Ubuntu 上运行。

它基本上可以工作，直到我关闭其中一个worker节点。几秒钟后，节点运行状态切换到unknown。running即使 Pod 位于离线节点上，Pod 也会保持报告状态。

k8s 不应该将这些 pod 移动到不同的健康主机上吗？我错过了什么吗？

谢谢建议！

Answer 1

Sha*_*k V 6

对于 Kubernetes 1.13 及更高版本，节点故障/未就绪情况下的 pod 驱逐实际上由taints 和 tolerations控制。--pod-eviction-timeout参数不再使用。

当节点宕机或未准备好时，node-controller/kubelet 会向节点添加以下污点 -node.kubernetes.io/unreachable和node.kubernetes.io/not-ready. 默认情况下，所有 pod 都会容忍这些污点 300 秒。您可以kube-api-server使用tolerationspod 规范中的对象来控制所有带有标志的pod 以及每个 pod 的此容忍时间集群范围。

集群范围配置：

您可以使用--default-not-ready-toleration-seconds和--default-unreachable-toleration-seconds标志将容忍时间集群范围修改为kube-api-server。

从文档：

--default-not-ready-toleration-seconds int     Default: 300
Indicates the tolerationSeconds of the toleration for notReady:NoExecute that is added by default to every pod that does not already have such a toleration.
--default-unreachable-toleration-seconds int     Default: 300

Run Code Online (Sandbox Code Playgroud)

每个 Pod 配置：

您还可以使用以下配置修改每个 Pod 的容忍时间。

tolerations:
  - key: "node.kubernetes.io/unreachable"
    operator: "Exists"
    effect: "NoExecute"
    tolerationSeconds: 120
  - key: "node.kubernetes.io/not-ready"
    operator: "Exists"
    effect: "NoExecute"
    tolerationSeconds: 120

Run Code Online (Sandbox Code Playgroud)

https://kubernetes.io/docs/concepts/configuration/taint-and-toleration/#taint-based-evictions

归档时间：	5 年，10 月前
查看次数：	1308 次
最近记录：	5 年，10 月前