use*_*892 9 kubernetes prometheus
有没有办法监控kube cronjob.
我有一个kube cronjob,它在我的集群上每10分钟运行一次.有没有办法在每次我的cronjob由于某些错误而失败时收集指标,或者在我的cronjob在一段时间后没有完成时通知.
我正在将这些规则与kube-state-metrics一起使用:
groups:
- name: job.rules
rules:
- alert: CronJobRunning
expr: time() -kube_cronjob_next_schedule_time > 3600
for: 1h
labels:
severity: warning
annotations:
description: CronJob {{$labels.namespaces}}/{{$labels.cronjob}} is taking more than 1h to complete
summary: CronJob didn't finish after 1h
- alert: JobCompletion
expr: kube_job_spec_completions - kube_job_status_succeeded > 0
for: 1h
labels:
severity: warning
annotations:
description: Job completion is taking more than 1h to complete
cronjob {{$labels.namespaces}}/{{$labels.job}}
summary: Job {{$labels.job}} didn't finish to complete after 1h
- alert: JobFailed
expr: kube_job_status_failed > 0
for: 1h
labels:
severity: warning
annotations:
description: Job {{$labels.namespaces}}/{{$labels.job}} failed to complete
summary: Job failed
Run Code Online (Sandbox Code Playgroud)
使用 Prometheus 监控 cronjobs 的方法是让他们推送一个指标,指示他们上次成功推送网关的时间。如果 cronjob 最近还没有成功,您可以发出警报。
到目前为止,所有答案都不知道名称空间,并且依赖于Job.
后者可以修复,因为 kube-state-metrics 版本 1.6.0 引入了一个新的指标,解决了s 和skube_job_owner的匹配问题。JobCronJob
注意:在 kube-state-metrics 1.4.0 中,job标签被重命名为job_name以避免与 Prometheus 发生标签冲突。
clamp_max(
max by (namespace, owner_name, job_name) (
max by (namespace, owner_name, job_name) (
kube_job_status_start_time
*
on (job_name) group_left(owner_name) max by (namespace, owner_name, job_name) (kube_job_owner{owner_kind="CronJob"})
)
==
on (namespace, owner_name) group_left max by (namespace, owner_name) (
kube_job_status_start_time
*
on (job_name) group_left(owner_name) max by (namespace, owner_name, job_name) (kube_job_owner{owner_kind="CronJob"})
)
),
1
)
*
on (namespace, job_name) group_left kube_job_status_failed
Run Code Online (Sandbox Code Playgroud)
owner_name
通过将标签重命名为cronjob,将表达式包围起来,可以进一步改进输出
max without (owner_name) (
label_replace(
<expression from above>
,
"cronjob", "$1", "owner_name", "(.+)"
)
)
Run Code Online (Sandbox Code Playgroud)
(该label_replace()函数添加新cronjob标签,同时max()删除owner_name标签)
这里最棘手的部分是cronjobs本身没有有用的状态,您必须将它们与它们创建的作业匹配。我写了一篇有关如何实现这一目标的文章:
https://medium.com/@tristan_96324/prometheus-k8s-cronjob-alerts-94bee7b90511
本文详细介绍了工作原理,但警报配置如下:
groups:
- name: kube-cron
rules:
- record: job_cronjob:kube_job_status_start_time:max
expr: |
label_replace(
label_replace(
max(
kube_job_status_start_time
* ON(exported_job) GROUP_RIGHT()
kube_job_labels{label_cronjob!=""}
) BY (exported_job, label_cronjob)
== ON(label_cronjob) GROUP_LEFT()
max(
kube_job_status_start_time
* ON(exported_job) GROUP_RIGHT()
kube_job_labels{label_cronjob!=""}
) BY (label_cronjob),
"job", "$1", "exported_job", "(.+)"),
"cronjob", "$1", "label_cronjob", "(.+)")
- record: job_cronjob:kube_job_status_failed:sum
expr: |
clamp_max(
job_cronjob:kube_job_status_start_time:max,
1)
* ON(job) GROUP_LEFT()
label_replace(
label_replace(
(kube_job_status_failed != 0),
"job", "$1", "exported_job", "(.+)"),
"cronjob", "$1", "label_cronjob", "(.+)")
- alert: CronJobStatusFailed
expr: |
job_cronjob:kube_job_status_failed:sum
* ON(cronjob) GROUP_RIGHT()
kube_cronjob_labels
> 0
for: 1m
annotations:
description: '{{ $labels.cronjob }} last run has failed {{$value }} times.'
Run Code Online (Sandbox Code Playgroud)
jobTemplate必须包含一个名为的标签cronjob,该标签与cronjob对象的名称匹配。
| 归档时间: |
|
| 查看次数: |
6365 次 |
| 最近记录: |