Kubernetes Pod 偶尔会抛出“ImagePullBackOff”或“ErrImagePull”

Question

Kubernetes Pod 偶尔会抛出“ImagePullBackOff”或“ErrImagePull”

我知道当 K8 无法拉取容器时会发生 ImagePullBackOff 或 ErrImagePull，但我不认为这里是这种情况。我这么说是因为随着我的服务扩展，这个错误只是由某些Pod 随机抛出，而其他 Pod 则完全正常，状态正常。

例如，请参阅此处的副本集。

我从这样一个失败的 Pod 中检索了事件。

Events:
  Type     Reason     Age                   From                                                          Message
  ----     ------     ----                  ----                                                          -------
  Normal   Scheduled  3m45s                 default-scheduler                                             Successfully assigned default/storefront-jtonline-prod-6dfbbd6bd8-jp5k5 to gke-square1-prod-clu-nap-n1-highcpu-2-82b95c00-p5gl
  Normal   Pulling    2m8s (x4 over 3m44s)  kubelet, gke-square1-prod-clu-nap-n1-highcpu-2-82b95c00-p5gl  pulling image "gcr.io/square1-2019/storefront-jtonline-prod:latest"
  Warning  Failed     2m7s (x4 over 3m43s)  kubelet, gke-square1-prod-clu-nap-n1-highcpu-2-82b95c00-p5gl  Failed to pull image "gcr.io/square1-2019/storefront-jtonline-prod:latest": rpc error: code = Unknown desc = Error response from daemon: unauthorized: You don't have the needed permissions to perform this operation, and you may have invalid credentials. To authenticate your request, follow the steps in: https://cloud.google.com/container-registry/docs/advanced-authentication
  Warning  Failed     2m7s (x4 over 3m43s)  kubelet, gke-square1-prod-clu-nap-n1-highcpu-2-82b95c00-p5gl  Error: ErrImagePull
  Normal   BackOff    113s (x6 over 3m42s)  kubelet, gke-square1-prod-clu-nap-n1-highcpu-2-82b95c00-p5gl  Back-off pulling image "gcr.io/square1-2019/storefront-jtonline-prod:latest"
  Warning  Failed     99s (x7 over 3m42s)   kubelet, gke-square1-prod-clu-nap-n1-highcpu-2-82b95c00-p5gl  Error: ImagePullBackOff

Run Code Online (Sandbox Code Playgroud)

日志告诉我，由于凭据不正确，它无法拉取容器，这似乎......令人困惑？该 Pod 是在自动缩放时自动创建的，与其他 Pod 完全相同。

我有一种感觉，这可能与资源有关。当集群由于流量激增而非常快地分拆新节点时，或者当我在部署配置中设置较低的资源请求时，我发现这些错误的发生率要高得多。

我该如何调试此错误，发生这种情况的可能原因是什么？

这是我的配置：

apiVersion: "extensions/v1beta1"
kind: "Deployment"
metadata:
  name: "storefront-_STOREFRONT-_ENV"
  namespace: "default"
  labels:
    app: "storefront-_STOREFRONT-_ENV"
spec:
  replicas: 10
  selector:
    matchLabels:
      app: "storefront-_STOREFRONT-_ENV"
  template:
    metadata:
      labels:
        app: "storefront-_STOREFRONT-_ENV"
    spec:
      containers:
      - name: "storefront-_STOREFRONT-_ENV"
        image: "gcr.io/square1-2019/storefront-_STOREFRONT-_ENV"
        ports:
        - containerPort: 8080
        readinessProbe:
          httpGet:
            path: /?healthz
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 1
        imagePullPolicy: Always

Run Code Online (Sandbox Code Playgroud)

apiVersion: "autoscaling/v2beta1"
kind: "HorizontalPodAutoscaler"
metadata:
  name: "storefront-_STOREFRONT-hpa"
  namespace: "default"
  labels:
    app: "storefront-_STOREFRONT-_ENV"
spec:
  scaleTargetRef:
    kind: "Deployment"
    name: "storefront-_STOREFRONT-_ENV"
    apiVersion: "apps/v1beta1"
  minReplicas: 10
  maxReplicas: 1000
  metrics:
  - type: "Resource"
    resource:
      name: "cpu"
      targetAverageUtilization: 75

Run Code Online (Sandbox Code Playgroud)

编辑：我已经能够验证这实际上是一个身份验证问题。这只发生在“某些”pod 上，因为它只发生在由于垂直扩展而自动创建的节点上调度的 pod 上。不过，我还不知道如何解决这个问题。

Answer 1

Cro*_*rou 1

正如我们在Kubernetes 文档中有关镜像的内容所读到的那样，如果您在 GKE 上运行集群，则无需执行任何操作。

注意：如果您在 Google Kubernetes Engine 上运行，则.dockercfg每个节点上都已经有一个具有 Google 容器注册表凭据的证书。您不能使用这种方法。

但同时也指出：

注意：如果您可以控制节点配置，则此方法适合。它无法在 GCE 以及任何其他进行自动节点更换的云提供商上可靠地工作。

也在Pod 上指定 ImagePullSecrets部分。

注意：此方法目前是 Google Kubernetes Engine、GCE 以及任何 自动创建节点的云提供商的推荐方法。

建议使用 Docker 配置创建 Secret。

这可以通过以下方式完成：

kubectl create secret docker-registry <name> --docker-server=DOCKER_REGISTRY_SERVER --docker-username=DOCKER_USER --docker-password=DOCKER_PASSWORD --docker-email=DOCKER_EMAIL

Run Code Online (Sandbox Code Playgroud)

归档时间：	6 年，2 月前
查看次数：	3312 次
最近记录：	6 年，1 月前