calico-kube-controllers 和 calico-node 未准备好 (CrashLoopBackOff)

Hos*_*ari 5 calico kubernetes kubespray

我使用 kubespray 部署了一个全新的 k8s 集群,一切正常,但所有与 calico 相关的 pod 尚未准备好。经过几个小时的调试,我找不到 calico pod 崩溃的原因。我什至禁用/停止了整个防火墙服务,但没有任何改变。

另一件重要的事情是calicoctl node status输出不稳定,每次调用都会显示不同的内容:

Calico process is not running.
Run Code Online (Sandbox Code Playgroud)
Calico process is running.

None of the BGP backend processes (BIRD or GoBGP) are running.
Run Code Online (Sandbox Code Playgroud)
Calico process is running.

IPv4 BGP status
+----------------+-------------------+-------+----------+---------+
|  PEER ADDRESS  |     PEER TYPE     | STATE |  SINCE   |  INFO   |
+----------------+-------------------+-------+----------+---------+
| 192.168.231.42 | node-to-node mesh | start | 06:23:41 | Passive |
+----------------+-------------------+-------+----------+---------+

IPv6 BGP status
No IPv6 peers found.
Run Code Online (Sandbox Code Playgroud)

另一个经常出现的日志是以下消息:

bird: Unable to open configuration file /etc/calico/confd/config/bird.cfg: No such file or directory
bird: Unable to open configuration file /etc/calico/confd/config/bird6.cfg: No such file or directory
Run Code Online (Sandbox Code Playgroud)

还尝试使用以下各项更改 IP_AUTODETECTION_METHOD 但没有任何改变:

kubectl set env daemonset/calico-node -n kube-system IP_AUTODETECTION_METHOD=can-reach=www.google.com
kubectl set env daemonset/calico-node -n kube-system IP_AUTODETECTION_METHOD=can-reach=8.8.8.8
kubectl set env daemonset/calico-node -n kube-system IP_AUTODETECTION_METHOD=interface=eth1
kubectl set env daemonset/calico-node -n kube-system IP_AUTODETECTION_METHOD=interface=eth.*
Run Code Online (Sandbox Code Playgroud)

预期行为

与 calico 相关的所有 pod、daemonset、deployment 和副本集都应处于 READY 状态。

目前的行为

与 calico 相关的所有 pod、daemonset、deployment 和副本集均处于 NOT READY 状态。

可能的解决方案

还没有,我正在寻求有关如何调试/克服这个问题的帮助。

重现步骤(针对错误)

它是 kubespray 的最新版本,具有以下上下文和环境。

git reflog

7e4b176 HEAD@{0}: clone: from https://github.com/kubernetes-sigs/kubespray.git
Run Code Online (Sandbox Code Playgroud)

语境

我正在尝试部署一个 k8s 集群,该集群有一个主节点和一个工作节点。另请注意,参与该集群的服务器位于几乎气隙/离线环境中,对全球互联网的访问受到限制,当然使用 kubespray 部署集群的 ansible 过程是成功的,但我在 calico pod 中遇到了这个问题。

您的环境

cat inventory/mycluster/hosts.yaml

all:
  hosts:
    node1:
      ansible_host: 192.168.231.41
      ansible_port: 32244
      ip: 192.168.231.41
      access_ip: 192.168.231.41
    node2:
      ansible_host: 192.168.231.42
      ansible_port: 32244
      ip: 192.168.231.42
      access_ip: 192.168.231.42
  children:
    kube_control_plane:
      hosts:
        node1:
    kube_node:
      hosts:
        node1:
        node2:
    etcd:
      hosts:
        node1:
    k8s_cluster:
      children:
        kube_control_plane:
        kube_node:
    calico_rr:
      hosts: {}
Run Code Online (Sandbox Code Playgroud)
calicoctl version

Client Version:    v3.19.2
Git commit:        6f3d4900
Cluster Version:   v3.19.2
Cluster Type:      kubespray,bgp,kubeadm,kdd,k8s
Run Code Online (Sandbox Code Playgroud)
cat /etc/centos-release

CentOS Linux release 7.9.2009 (Core)
Run Code Online (Sandbox Code Playgroud)
uname -r

3.10.0-1160.42.2.el7.x86_64
Run Code Online (Sandbox Code Playgroud)
kubectl version

Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.4", GitCommit:"3cce4a82b44f032d0cd1a1790e6d2f5a55d20aae", GitTreeState:"clean", BuildDate:"2021-08-11T18:16:05Z", GoVersion:"go1.16.7", Compiler:"gc", Platform:"linux/amd64"}

Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.4", GitCommit:"3cce4a82b44f032d0cd1a1790e6d2f5a55d20aae", GitTreeState:"clean", BuildDate:"2021-08-11T18:10:22Z", GoVersion:"go1.16.7", Compiler:"gc", Platform:"linux/amd64"}
Run Code Online (Sandbox Code Playgroud)
 kubectl get nodes -o wide

NAME    STATUS   ROLES                  AGE   VERSION   INTERNAL-IP      EXTERNAL-IP   OS-IMAGE                KERNEL-VERSION                CONTAINER-RUNTIME
node1   Ready    control-plane,master   19h   v1.21.4   192.168.231.41   <none>        CentOS Linux 7 (Core)   3.10.0-1160.42.2.el7.x86_64   docker://20.10.8
node2   Ready    <none>                 19h   v1.21.4   192.168.231.42   <none>        CentOS Linux 7 (Core)   3.10.0-1160.42.2.el7.x86_64   docker://20.10.8
Run Code Online (Sandbox Code Playgroud)
kubectl get all --all-namespaces -o wide

NAMESPACE     NAME                                           READY   STATUS             RESTARTS   AGE   IP               NODE    NOMINATED NODE   READINESS GATES
kube-system   pod/calico-kube-controllers-8575b76f66-57zw4   0/1     CrashLoopBackOff   327        19h   192.168.231.42   node2   <none>           <none>
kube-system   pod/calico-node-4hkzb                          0/1     Running            245        14h   192.168.231.42   node2   <none>           <none>
kube-system   pod/calico-node-hznhc                          0/1     Running            245        14h   192.168.231.41   node1   <none>           <none>
kube-system   pod/coredns-8474476ff8-b6lqz                   1/1     Running            0          19h   10.233.96.1      node2   <none>           <none>
kube-system   pod/coredns-8474476ff8-gdkml                   1/1     Running            0          19h   10.233.90.1      node1   <none>           <none>
kube-system   pod/dns-autoscaler-7df78bfcfb-xnn4r            1/1     Running            0          19h   10.233.90.2      node1   <none>           <none>
kube-system   pod/kube-apiserver-node1                       1/1     Running            0          19h   192.168.231.41   node1   <none>           <none>
kube-system   pod/kube-controller-manager-node1              1/1     Running            0          19h   192.168.231.41   node1   <none>           <none>
kube-system   pod/kube-proxy-dmw22                           1/1     Running            0          19h   192.168.231.41   node1   <none>           <none>
kube-system   pod/kube-proxy-wzpnv                           1/1     Running            0          19h   192.168.231.42   node2   <none>           <none>
kube-system   pod/kube-scheduler-node1                       1/1     Running            0          19h   192.168.231.41   node1   <none>           <none>
kube-system   pod/nginx-proxy-node2                          1/1     Running            0          19h   192.168.231.42   node2   <none>           <none>
kube-system   pod/nodelocaldns-6h5q2                         1/1     Running            0          19h   192.168.231.42   node2   <none>           <none>
kube-system   pod/nodelocaldns-7fwbd                         1/1     Running            0          19h   192.168.231.41   node1   <none>           <none>

NAMESPACE     NAME                 TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)                  AGE   SELECTOR
default       service/kubernetes   ClusterIP   10.233.0.1   <none>        443/TCP                  19h   <none>
kube-system   service/coredns      ClusterIP   10.233.0.3   <none>        53/UDP,53/TCP,9153/TCP   19h   k8s-app=kube-dns

NAMESPACE     NAME                          DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR            AGE   CONTAINERS    IMAGES                                     SELECTOR
kube-system   daemonset.apps/calico-node    2         2         0       2            0           kubernetes.io/os=linux   19h   calico-node   quay.io/calico/node:v3.19.2                k8s-app=calico-node
kube-system   daemonset.apps/kube-proxy     2         2         2       2            2           kubernetes.io/os=linux   19h   kube-proxy    k8s.gcr.io/kube-proxy:v1.21.4              k8s-app=kube-proxy
kube-system   daemonset.apps/nodelocaldns   2         2         2       2            2           kubernetes.io/os=linux   19h   node-cache    k8s.gcr.io/dns/k8s-dns-node-cache:1.17.1   k8s-app=nodelocaldns

NAMESPACE     NAME                                      READY   UP-TO-DATE   AVAILABLE   AGE   CONTAINERS                IMAGES                                                       SELECTOR
kube-system   deployment.apps/calico-kube-controllers   0/1     1            0           19h   calico-kube-controllers   quay.io/calico/kube-controllers:v3.19.2                      k8s-app=calico-kube-controllers
kube-system   deployment.apps/coredns                   2/2     2            2           19h   coredns                   k8s.gcr.io/coredns/coredns:v1.8.0                            k8s-app=kube-dns
kube-system   deployment.apps/dns-autoscaler            1/1     1            1           19h   autoscaler                k8s.gcr.io/cpa/cluster-proportional-autoscaler-amd64:1.8.3   k8s-app=dns-autoscaler

NAMESPACE     NAME                                                 DESIRED   CURRENT   READY   AGE   CONTAINERS                IMAGES                                                       SELECTOR
kube-system   replicaset.apps/calico-kube-controllers-8575b76f66   1         1         0       19h   calico-kube-controllers   quay.io/calico/kube-controllers:v3.19.2                      k8s-app=calico-kube-controllers,pod-template-hash=8575b76f66
kube-system   replicaset.apps/coredns-8474476ff8                   2         2         2       19h   coredns                   k8s.gcr.io/coredns/coredns:v1.8.0                            k8s-app=kube-dns,pod-template-hash=8474476ff8
kube-system   replicaset.apps/dns-autoscaler-7df78bfcfb            1         1         1       19h   autoscaler                k8s.gcr.io/cpa/cluster-proportional-autoscaler-amd64:1.8.3   k8s-app=dns-autoscaler,pod-template-hash=7df78bfcfb
Run Code Online (Sandbox Code Playgroud)

Hos*_*ari 5

幸运的是,timeoutSecondslivenessProbe&readinessProbe1增加到60解决了这个问题。

kubectl edit -n kube-system daemonset.apps/calico-node
kubectl edit -n kube-system deployment.apps/calico-kube-controllers
Run Code Online (Sandbox Code Playgroud)

https://github.com/projectcalico/calico/issues/4935