我正在处理一个 Kubernetes cron 作业,它代表了一个集成测试;它是 Go 测试二进制文件,它被编译go test -c并复制到由 cron 作业运行的 Docker 容器中。Kubernetes YAML 的启动类似于以下内容:
apiVersion: batch/v1beta1
kind: CronJob
spec:
schedule: "*/15 * * * *"
concurrencyPolicy: Forbid
successfulJobsHistoryLimit: 7
failedJobsHistoryLimit: 7
jobTemplate:
spec:
template:
spec:
restartPolicy: Never
Run Code Online (Sandbox Code Playgroud)
在某些时候,集成测试开始失败(以代码 1 退出)。我可以看到该工作的持续时间与其年龄相同:
$ kubectl get jobs -l app=integration-test
NAME COMPLETIONS DURATION AGE
integration-test-1592457300 0/1 7m20s 7m20s
Run Code Online (Sandbox Code Playgroud)
这些kubectl get pods命令显示 pod 的创建频率高于我对 cron 计划的预期,每 15 分钟创建一次:
$ kubectl get pods -l app=integration-test
NAME READY STATUS RESTARTS AGE
integration-test-1592457300-224x8 0/1 Error 0 92s
integration-test-1592457300-5f8sz 0/1 Error 0 7m33s
integration-test-1592457300-9zvjq 0/1 Error 0 3m57s
integration-test-1592457300-th7sf 0/1 Error 0 6m26s
integration-test-1592457300-vhbr2 0/1 Error 0 5m17s
Run Code Online (Sandbox Code Playgroud)
这种启动新 Pod 的行为是有问题的,因为它会影响节点上运行的 Pod 数量——本质上,它会消耗资源。
我怎样才能做到让 cron 作业不会继续启动新的 pod,而是每 15 分钟只运行一个,并且在作业失败时不会继续消耗资源?
一个简化的示例使用改编自https://kubernetes.io/docs/tasks/job/automated-tasks-with-cron-jobs/的 Kubernetes YAML :
$ cat cronjob.yaml
apiVersion: batch/v1beta1
kind: CronJob
metadata:
name: hello
spec:
schedule: "*/1 * * * *"
jobTemplate:
spec:
template:
spec:
containers:
- name: hello
image: busybox
args:
- /bin/sh
- -c
- date; echo Hello from the Kubernetes cluster; exit 1
restartPolicy: Never
Run Code Online (Sandbox Code Playgroud)
请注意,它以代码 1 退出。如果我使用它运行它kubernetes apply -f cronjob.yaml然后检查 pod,我看到
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
hello-1592459760-fnvcw 0/1 Error 0 30s
hello-1592459760-w75lt 0/1 Error 0 31s
hello-1592459760-xzhwn 0/1 Error 0 20s
Run Code Online (Sandbox Code Playgroud)
豆荚的年龄相隔不到一分钟;换句话说,pod 会在 cron 间隔结束之前启动。我怎样才能防止这种情况?
这是非常具体的场景,很难猜测您想要实现什么以及它是否适合您。
concurrencyPolicy: Forbid阻止创建另一个job如果以前不是completed。但我认为这里的情况并非如此。
restartPolicy适用于pod(但是Job template您只能使用OnFailureand Never)。如果您将设置restartPolicy为Never,job将自动创建新pods直到完成。
一个 Job 创建一个或多个 Pod,并确保指定数量的 Pod 成功终止。当 pod 成功完成时,Job 会跟踪成功完成情况。
如果设置restartPolicy: Never将创建豆荚,直到它达到backoffLimit,但这些pods将与您的集群仍然可见Error与每个吊舱退出状态status 1。您需要手动删除它。如果您设置restartPolicy: OnFailure它,它将重新启动一个pod并且不会创建更多。
但还有另一种方式。什么被视为completed工作?
例子:
1. restartPolicy: OnFailure
$ kubectl get po,jobs,cronjob
NAME READY STATUS RESTARTS AGE
pod/hello-1592495280-w27mt 0/1 CrashLoopBackOff 5 5m21s
pod/hello-1592495340-tzc64 0/1 CrashLoopBackOff 5 4m21s
pod/hello-1592495400-w8cm6 0/1 CrashLoopBackOff 5 3m21s
pod/hello-1592495460-jjlx5 0/1 CrashLoopBackOff 4 2m21s
pod/hello-1592495520-c59tm 0/1 CrashLoopBackOff 3 80s
pod/hello-1592495580-rrdzw 0/1 Error 2 20s
NAME COMPLETIONS DURATION AGE
job.batch/hello-1592495220 0/1 6m22s 6m22s
job.batch/hello-1592495280 0/1 5m22s 5m22s
job.batch/hello-1592495340 0/1 4m22s 4m22s
job.batch/hello-1592495400 0/1 3m22s 3m22s
job.batch/hello-1592495460 0/1 2m22s 2m22s
job.batch/hello-1592495520 0/1 81s 81s
job.batch/hello-1592495580 0/1 21s 21s
NAME SCHEDULE SUSPEND ACTIVE LAST SCHEDULE AGE
cronjob.batch/hello */1 * * * * False 6 25s 15m
Run Code Online (Sandbox Code Playgroud)
每个job将只创建 1 个pod,它将重新启动,直到job将被finished或将被视为completed由CronJob.
如果您将CronJob在Event部分中描述,您可以找到。
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SuccessfulCreate 18m cronjob-controller Created job hello-1592494740
Normal SuccessfulCreate 17m cronjob-controller Created job hello-1592494800
Normal SuccessfulCreate 16m cronjob-controller Created job hello-1592494860
Normal SuccessfulCreate 15m cronjob-controller Created job hello-1592494920
Normal SuccessfulCreate 14m cronjob-controller Created job hello-1592494980
Normal SuccessfulCreate 13m cronjob-controller Created job hello-1592495040
Normal SawCompletedJob 12m cronjob-controller Saw completed job: hello-1592494740
Normal SuccessfulCreate 12m cronjob-controller Created job hello-1592495100
Normal SawCompletedJob 11m cronjob-controller Saw completed job: hello-1592494800
Normal SuccessfulDelete 11m cronjob-controller Deleted job hello-1592494740
Normal SuccessfulCreate 11m cronjob-controller Created job hello-1592495160
Normal SawCompletedJob 10m cronjob-controller Saw completed job: hello-1592494860
Run Code Online (Sandbox Code Playgroud)
为什么工作hello-1592494740被认为是Completed?Cronjob默认值为.spec.backoffLimit6(可以在文档中找到此信息)。如果job将失败 6 次(pod 将无法重新启动 6 次)Cronjob会将此job视为Completed并将其删除。如job被移除,也pod将被移除。
但是,在您的示例中,pod创建后,pod 执行了 date 和 echo 命令,然后以代码 1 退出。即使pod是 Crashing 它也写入了信息。正如最后一个命令一样exit 1,它会崩溃直到达到极限。按照下面的例子:
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
hello-1592495400-w8cm6 0/1 Terminating 6 5m51s
hello-1592495460-jjlx5 0/1 CrashLoopBackOff 5 4m51s
hello-1592495520-c59tm 0/1 CrashLoopBackOff 5 3m50s
hello-1592495580-rrdzw 0/1 CrashLoopBackOff 4 2m50s
hello-1592495640-nbq59 0/1 CrashLoopBackOff 4 110s
hello-1592495700-p6pcx 0/1 Error 3 50s
user@cloudshell:~ (project)$ kubectl logs hello-1592495520-c59tm
Thu Jun 18 15:55:13 UTC 2020
Hello from the Kubernetes cluster
Run Code Online (Sandbox Code Playgroud)
2.restartPolicy: Never和backoffLimit: 0
使用了以下 YAML:
apiVersion: batch/v1beta1
kind: CronJob
metadata:
name: hello
spec:
schedule: "*/1 * * * *"
jobTemplate:
spec:
template:
spec:
containers:
- name: hello
image: busybox
args:
- /bin/sh
- -c
- date; echo Hello from the Kubernetes cluster; exit 1
restartPolicy: Never
backoffLimit: 0
Run Code Online (Sandbox Code Playgroud)
输出
$ kubectl get po,jobs,cronjob
NAME READY STATUS RESTARTS AGE
pod/hello-1592497320-svd6k 0/1 Error 0 44s
NAME COMPLETIONS DURATION AGE
job.batch/hello-1592497320 0/1 44s 44s
NAME SCHEDULE SUSPEND ACTIVE LAST SCHEDULE AGE
cronjob.batch/hello */1 * * * * False 0 51s 11m
$ kubectl describe cronjob
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SuccessfulCreate 12m cronjob-controller Created job hello-1592496720
Normal SawCompletedJob 11m cronjob-controller Saw completed job: hello-1592496720
Normal SuccessfulCreate 11m cronjob-controller Created job hello-1592496780
Normal SawCompletedJob 10m cronjob-controller Saw completed job: hello-1592496780
Normal SuccessfulDelete 10m cronjob-controller Deleted job hello-1592496720
Normal SuccessfulCreate 10m cronjob-controller Created job hello-1592496840
Normal SuccessfulDelete 9m55s cronjob-controller Deleted job hello-1592496780
Normal SawCompletedJob 9m55s cronjob-controller Saw completed job: hello-1592496840
Normal SuccessfulCreate 9m5s cronjob-controller Created job hello-1592496900
Normal SawCompletedJob 8m55s cronjob-controller Saw completed job: hello-1592496900
Normal SuccessfulDelete 8m55s cronjob-controller Deleted job hello-1592496840
Normal SuccessfulCreate 8m5s cronjob-controller Created job hello-1592496960
Normal SawCompletedJob 7m55s cronjob-controller Saw completed job: hello-1592496960
Normal SuccessfulDelete 7m55s cronjob-controller Deleted job hello-1592496900
Normal SuccessfulCreate 7m4s cronjob-controller Created job hello-1592497020
Run Code Online (Sandbox Code Playgroud)
这样只有一个job和一个pod同时运行(当有 2 个作业和 2 个 pod 时,可能会有 10 秒的间隔)。
$ kubectl get po,job
NAME READY STATUS RESTARTS AGE
pod/hello-1592497440-twzlf 0/1 Error 0 70s
pod/hello-1592497500-2q7fq 0/1 Error 0 10s
NAME COMPLETIONS DURATION AGE
job.batch/hello-1592497440 0/1 70s 70s
job.batch/hello-1592497500 0/1 10s 10s
user@cloudshell:~ (project)$ kk get po,job
NAME READY STATUS RESTARTS AGE
pod/hello-1592497500-2q7fq 0/1 Error 0 11s
NAME COMPLETIONS DURATION AGE
job.batch/hello-1592497500 0/1 11s 11s
Run Code Online (Sandbox Code Playgroud)
我希望它清除了一点。如果您想要更准确的答案,请提供有关您的场景的更多信息。
| 归档时间: |
|
| 查看次数: |
1529 次 |
| 最近记录: |