Kubernetes 作业不断旋转 pod,最终出现“错误”状态

Kur*_*eek 1 kubernetes

我正在处理一个 Kubernetes cron 作业,它代表了一个集成测试;它是 Go 测试二进制文件,它被编译go test -c并复制到由 cron 作业运行的 Docker 容器中。Kubernetes YAML 的启动类似于以下内容:

apiVersion: batch/v1beta1
kind: CronJob
spec:
  schedule: "*/15 * * * *"
  concurrencyPolicy: Forbid
  successfulJobsHistoryLimit: 7
  failedJobsHistoryLimit: 7
  jobTemplate:
    spec:
      template:
        spec:
          restartPolicy: Never
Run Code Online (Sandbox Code Playgroud)

在某些时候,集成测试开始失败(以代码 1 退出)。我可以看到该工作的持续时间与其年龄相同:

$ kubectl get jobs -l app=integration-test
NAME                          COMPLETIONS   DURATION   AGE
integration-test-1592457300   0/1           7m20s      7m20s
Run Code Online (Sandbox Code Playgroud)

这些kubectl get pods命令显示 pod 的创建频率高于我对 cron 计划的预期,每 15 分钟创建一次:

$ kubectl get pods -l app=integration-test
NAME                                READY   STATUS   RESTARTS   AGE
integration-test-1592457300-224x8   0/1     Error    0          92s
integration-test-1592457300-5f8sz   0/1     Error    0          7m33s
integration-test-1592457300-9zvjq   0/1     Error    0          3m57s
integration-test-1592457300-th7sf   0/1     Error    0          6m26s
integration-test-1592457300-vhbr2   0/1     Error    0          5m17s
Run Code Online (Sandbox Code Playgroud)

这种启动新 Pod 的行为是有问题的,因为它会影响节点上运行的 Pod 数量——本质上,它会消耗资源。

我怎样才能做到让 cron 作业不会继续启动新的 pod,而是每 15 分钟只运行一个,并且在作业失败时不会继续消耗资源?

更新

一个简化的示例使用改编自https://kubernetes.io/docs/tasks/job/automated-tasks-with-cron-jobs/的 Kubernetes YAML :

$ cat cronjob.yaml
apiVersion: batch/v1beta1
kind: CronJob
metadata:
  name: hello
spec:
  schedule: "*/1 * * * *"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: hello
            image: busybox
            args:
            - /bin/sh
            - -c
            - date; echo Hello from the Kubernetes cluster; exit 1
          restartPolicy: Never
Run Code Online (Sandbox Code Playgroud)

请注意,它以代码 1 退出。如果我使用它运行它kubernetes apply -f cronjob.yaml然后检查 pod,我看到

$ kubectl get pods
NAME                                                    READY   STATUS      RESTARTS   AGE
hello-1592459760-fnvcw                                  0/1     Error       0          30s
hello-1592459760-w75lt                                  0/1     Error       0          31s
hello-1592459760-xzhwn                                  0/1     Error       0          20s
Run Code Online (Sandbox Code Playgroud)

豆荚的年龄相隔不到一分钟;换句话说,pod 会在 cron 间隔结束之前启动。我怎样才能防止这种情况?

Pjo*_*erS 5

这是非常具体的场景,很难猜测您想要实现什么以及它是否适合您。

concurrencyPolicy: Forbid阻止创建另一个job如果以前不是completed。但我认为这里的情况并非如此。

restartPolicy适用于pod(但是Job template您只能使用OnFailureand Never)。如果您将设置restartPolicyNeverjob将自动创建新pods直到完成。

一个 Job 创建一个或多个 Pod,并确保指定数量的 Pod 成功终止。当 pod 成功完成时,Job 会跟踪成功完成情况。

如果设置restartPolicy: Never将创建豆荚,直到它达到backoffLimit,但这些pods将与您的集群仍然可见Error与每个吊舱退出状态status 1。您需要手动删除它。如果您设置restartPolicy: OnFailure它,它将重新启动一个pod并且不会创建更多。

但还有另一种方式。什么被视为completed工作?

例子:

1. restartPolicy: OnFailure

$ kubectl get po,jobs,cronjob
NAME                         READY   STATUS             RESTARTS   AGE
pod/hello-1592495280-w27mt   0/1     CrashLoopBackOff   5          5m21s
pod/hello-1592495340-tzc64   0/1     CrashLoopBackOff   5          4m21s
pod/hello-1592495400-w8cm6   0/1     CrashLoopBackOff   5          3m21s
pod/hello-1592495460-jjlx5   0/1     CrashLoopBackOff   4          2m21s
pod/hello-1592495520-c59tm   0/1     CrashLoopBackOff   3          80s
pod/hello-1592495580-rrdzw   0/1     Error              2          20s
NAME                         COMPLETIONS   DURATION   AGE
job.batch/hello-1592495220   0/1           6m22s      6m22s
job.batch/hello-1592495280   0/1           5m22s      5m22s
job.batch/hello-1592495340   0/1           4m22s      4m22s
job.batch/hello-1592495400   0/1           3m22s      3m22s
job.batch/hello-1592495460   0/1           2m22s      2m22s
job.batch/hello-1592495520   0/1           81s        81s
job.batch/hello-1592495580   0/1           21s        21s
NAME                  SCHEDULE      SUSPEND   ACTIVE   LAST SCHEDULE   AGE
cronjob.batch/hello   */1 * * * *   False     6        25s             15m
Run Code Online (Sandbox Code Playgroud)

每个job将只创建 1 个pod,它将重新启动,直到job将被finished或将被视为completedCronJob.

如果您将CronJobEvent部分中描述,您可以找到。

Events:
  Type    Reason            Age                  From                Message
  ----    ------            ----                 ----                -------
  Normal  SuccessfulCreate  18m                  cronjob-controller  Created job hello-1592494740
  Normal  SuccessfulCreate  17m                  cronjob-controller  Created job hello-1592494800
  Normal  SuccessfulCreate  16m                  cronjob-controller  Created job hello-1592494860
  Normal  SuccessfulCreate  15m                  cronjob-controller  Created job hello-1592494920
  Normal  SuccessfulCreate  14m                  cronjob-controller  Created job hello-1592494980
  Normal  SuccessfulCreate  13m                  cronjob-controller  Created job hello-1592495040
  Normal  SawCompletedJob   12m                  cronjob-controller  Saw completed job: hello-1592494740
  Normal  SuccessfulCreate  12m                  cronjob-controller  Created job hello-1592495100
  Normal  SawCompletedJob   11m                  cronjob-controller  Saw completed job: hello-1592494800
  Normal  SuccessfulDelete  11m                  cronjob-controller  Deleted job hello-1592494740
  Normal  SuccessfulCreate  11m                  cronjob-controller  Created job hello-1592495160
  Normal  SawCompletedJob   10m                  cronjob-controller  Saw completed job: hello-1592494860
Run Code Online (Sandbox Code Playgroud)

为什么工作hello-1592494740被认为是CompletedCronjob默认值为.spec.backoffLimit6(可以在文档中找到此信息)。如果job将失败 6 次(pod 将无法重新启动 6 次)Cronjob会将此job视为Completed并将其删除。如job被移除,也pod将被移除。

但是,在您的示例中,pod创建后,pod 执行了 date 和 echo 命令,然后以代码 1 退出。即使pod是 Crashing 它也写入了信息。正如最后一个命令一样exit 1,它会崩溃直到达到极限。按照下面的例子:

$ kubectl get pods
NAME                     READY   STATUS             RESTARTS   AGE
hello-1592495400-w8cm6   0/1     Terminating        6          5m51s
hello-1592495460-jjlx5   0/1     CrashLoopBackOff   5          4m51s
hello-1592495520-c59tm   0/1     CrashLoopBackOff   5          3m50s
hello-1592495580-rrdzw   0/1     CrashLoopBackOff   4          2m50s
hello-1592495640-nbq59   0/1     CrashLoopBackOff   4          110s
hello-1592495700-p6pcx   0/1     Error              3          50s
user@cloudshell:~ (project)$ kubectl logs hello-1592495520-c59tm
Thu Jun 18 15:55:13 UTC 2020
Hello from the Kubernetes cluster
Run Code Online (Sandbox Code Playgroud)

2.restartPolicy: NeverbackoffLimit: 0

使用了以下 YAML:

apiVersion: batch/v1beta1
kind: CronJob
metadata:
  name: hello
spec:
  schedule: "*/1 * * * *"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: hello
            image: busybox
            args:
            - /bin/sh
            - -c
            - date; echo Hello from the Kubernetes cluster; exit 1
          restartPolicy: Never
      backoffLimit: 0
Run Code Online (Sandbox Code Playgroud)

输出

$ kubectl get po,jobs,cronjob
NAME                         READY   STATUS   RESTARTS   AGE
pod/hello-1592497320-svd6k   0/1     Error    0          44s
NAME                         COMPLETIONS   DURATION   AGE
job.batch/hello-1592497320   0/1           44s        44s
NAME                  SCHEDULE      SUSPEND   ACTIVE   LAST SCHEDULE   AGE
cronjob.batch/hello   */1 * * * *   False     0        51s             11m

$ kubectl describe cronjob
...
Events:
  Type    Reason            Age                  From                Message
  ----    ------            ----                 ----                -------
  Normal  SuccessfulCreate  12m                  cronjob-controller  Created job hello-1592496720
  Normal  SawCompletedJob   11m                  cronjob-controller  Saw completed job: hello-1592496720
  Normal  SuccessfulCreate  11m                  cronjob-controller  Created job hello-1592496780
  Normal  SawCompletedJob   10m                  cronjob-controller  Saw completed job: hello-1592496780
  Normal  SuccessfulDelete  10m                  cronjob-controller  Deleted job hello-1592496720
  Normal  SuccessfulCreate  10m                  cronjob-controller  Created job hello-1592496840
  Normal  SuccessfulDelete  9m55s                cronjob-controller  Deleted job hello-1592496780
  Normal  SawCompletedJob   9m55s                cronjob-controller  Saw completed job: hello-1592496840
  Normal  SuccessfulCreate  9m5s                 cronjob-controller  Created job hello-1592496900
  Normal  SawCompletedJob   8m55s                cronjob-controller  Saw completed job: hello-1592496900
  Normal  SuccessfulDelete  8m55s                cronjob-controller  Deleted job hello-1592496840
  Normal  SuccessfulCreate  8m5s                 cronjob-controller  Created job hello-1592496960
  Normal  SawCompletedJob   7m55s                cronjob-controller  Saw completed job: hello-1592496960
  Normal  SuccessfulDelete  7m55s                cronjob-controller  Deleted job hello-1592496900
  Normal  SuccessfulCreate  7m4s                 cronjob-controller  Created job hello-1592497020
Run Code Online (Sandbox Code Playgroud)

这样只有一个job和一个pod同时运行(当有 2 个作业和 2 个 pod 时,可能会有 10 秒的间隔)。

$ kubectl get po,job
NAME                         READY   STATUS   RESTARTS   AGE
pod/hello-1592497440-twzlf   0/1     Error    0          70s
pod/hello-1592497500-2q7fq   0/1     Error    0          10s

NAME                         COMPLETIONS   DURATION   AGE
job.batch/hello-1592497440   0/1           70s        70s
job.batch/hello-1592497500   0/1           10s        10s
user@cloudshell:~ (project)$ kk get po,job
NAME                         READY   STATUS   RESTARTS   AGE
pod/hello-1592497500-2q7fq   0/1     Error    0          11s

NAME                         COMPLETIONS   DURATION   AGE
job.batch/hello-1592497500   0/1           11s        11s
Run Code Online (Sandbox Code Playgroud)

我希望它清除了一点。如果您想要更准确的答案,请提供有关您的场景的更多信息。