了解 Kubernetes 作业中的 backoffLimit

Question

了解 Kubernetes 作业中的 backoffLimit

我Cronjob在 kubernetes 中使用 schedule( 8 * * * *)创建了一个，作业backoffLimit默认为 6，podRestartPolicy为Never，pod被故意配置为 FAIL。据我了解，（对于 podSpec with restartPolicy : Never）作业控制器将尝试创建backoffLimit数量的 pod，然后将作业标记为Failed，因此，我预计会有 6 个 pod 处于Error状态。

这是实际工作的状态：

status:
  conditions:
  - lastProbeTime: 2019-02-20T05:11:58Z
    lastTransitionTime: 2019-02-20T05:11:58Z
    message: Job has reached the specified backoff limit
    reason: BackoffLimitExceeded
    status: "True"
    type: Failed
  failed: 5

Run Code Online (Sandbox Code Playgroud)

为什么只有 5 个失败的 Pod 而不是 6 个？还是我的理解backoffLimit不正确？

Answer 1

MWZ*_*MWZ 13

简而言之：您可能看不到所有创建的 pod，因为 cronjob 中的计划周期太短。

如文档中所述：

与 Job 关联的失败 Pod 由 Job 控制器重新创建，其指数退避延迟（10 秒、20 秒、40 秒……）上限为 6 分钟。如果在 Job 的下一次状态检查之前没有出现新的失败 Pod，则会重置回退计数。

如果在作业控制器有机会重新创建 Pod 之前安排了新作业（记住上一次失败后的延迟），作业控制器会再次从 1 开始计数。

我使用以下方法在 GKE 中重现了您的问题.yaml：

apiVersion: batch/v1beta1
kind: CronJob
metadata:
  name: hellocron
spec:
  schedule: "*/3 * * * *" #Runs every 3 minutes
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: hellocron
            image: busybox
            args:
            - /bin/cat
            - /etc/os
          restartPolicy: Never
      backoffLimit: 6
  suspend: false

Run Code Online (Sandbox Code Playgroud)

此作业将失败，因为文件/etc/os不存在。

这是kubectl describe其中一项工作的输出：

Name:           hellocron-1551194280
Namespace:      default
Selector:       controller-uid=b81cdfb8-39d9-11e9-9eb7-42010a9c00d0
Labels:         controller-uid=b81cdfb8-39d9-11e9-9eb7-42010a9c00d0
                job-name=hellocron-1551194280
Annotations:    <none>
Controlled By:  CronJob/hellocron
Parallelism:    1
Completions:    1
Start Time:     Tue, 26 Feb 2019 16:18:07 +0100
Pods Statuses:  0 Running / 0 Succeeded / 6 Failed
Pod Template:
  Labels:  controller-uid=b81cdfb8-39d9-11e9-9eb7-42010a9c00d0
           job-name=hellocron-1551194280
  Containers:
   hellocron:
    Image:      busybox
    Port:       <none>
    Host Port:  <none>
    Args:
      /bin/cat
      /etc/os
    Environment:  <none>
    Mounts:       <none>
  Volumes:        <none>
Events:
  Type     Reason                Age   From            Message
  ----     ------                ----  ----            -------
  Normal   SuccessfulCreate      26m   job-controller  Created pod: hellocron-1551194280-4lf6h
  Normal   SuccessfulCreate      26m   job-controller  Created pod: hellocron-1551194280-85khk
  Normal   SuccessfulCreate      26m   job-controller  Created pod: hellocron-1551194280-wrktb
  Normal   SuccessfulCreate      26m   job-controller  Created pod: hellocron-1551194280-6942s
  Normal   SuccessfulCreate      25m   job-controller  Created pod: hellocron-1551194280-662zv
  Normal   SuccessfulCreate      22m   job-controller  Created pod: hellocron-1551194280-6c6rh
  Warning  BackoffLimitExceeded  17m   job-controller  Job has reached the specified backoff limit

Run Code Online (Sandbox Code Playgroud)

请注意创建 podhellocron-1551194280-662zv和hellocron-1551194280-6c6rh.

`backoffLimit` 指定作业控制器放弃前的重试次数。 (3认同)

Answer 2

小智 5

使用spec.backoffLimit考虑工作为失败之前指定的重试次数。默认情况下，回退限制设置为 6。

这 6 次重试之间的间隔是多少？ (4认同)
是否有任何值将限制设置为无限？ (2认同)
这是从 10 秒开始的指数退避延迟。所以10秒、20秒、40秒等等。它的上限是6分钟，所以即使设置很高的重试次数也不是那么疯狂。 (2认同)

归档时间：	6 年，11 月前
查看次数：	49555 次
最近记录：	4 年，5 月前