有没有办法用普罗米修斯来监控kube cron的工作

use*_*892 9 kubernetes prometheus

有没有办法监控kube cronjob.

我有一个kube cronjob,它在我的集群上每10分钟运行一次.有没有办法在每次我的cronjob由于某些错误而失败时收集指标,或者在我的cronjob在一段时间后没有完成时通知.

Cam*_*mil 8

我正在将这些规则与kube-state-metrics一起使用:

groups:
- name: job.rules
  rules:
  - alert: CronJobRunning
    expr: time() -kube_cronjob_next_schedule_time > 3600
    for: 1h
    labels:
      severity: warning
    annotations:
      description: CronJob {{$labels.namespaces}}/{{$labels.cronjob}} is taking more than 1h to complete
      summary: CronJob didn't finish after 1h

  - alert: JobCompletion
    expr: kube_job_spec_completions - kube_job_status_succeeded  > 0
    for: 1h
    labels:
      severity: warning
    annotations:
      description: Job completion is taking more than 1h to complete
        cronjob {{$labels.namespaces}}/{{$labels.job}}
      summary: Job {{$labels.job}} didn't finish to complete after 1h

  - alert: JobFailed
    expr: kube_job_status_failed  > 0
    for: 1h
    labels:
      severity: warning
    annotations:
      description: Job {{$labels.namespaces}}/{{$labels.job}} failed to complete
      summary: Job failed
Run Code Online (Sandbox Code Playgroud)

  • 这存在无限期警报的问题,因为 CronJobs 会保留作业,直到达到历史限制。 (9认同)

bri*_*zil 6

使用 Prometheus 监控 cronjobs 的方法是让他们推送一个指标,指示他们上次成功推送网关的时间。如果 cronjob 最近还没有成功,您可以发出警报。

  • 不鼓励在指标上使用时间戳。有一个值为时间戳的指标(例如`process_start_time_seconds`)就可以了。 (2认同)

Per*_*erC 6

到目前为止,所有答案都不知道名称空间,并且依赖于Job.

后者可以修复,因为 kube-state-metrics 版本 1.6.0 引入了一个新的指标,解决了s 和skube_job_owner的匹配问题。JobCronJob

注意:在 kube-state-metrics 1.4.0 中,job标签被重命名为job_name以避免与 Prometheus 发生标签冲突。

clamp_max(
  max by (namespace, owner_name, job_name) (
    max by (namespace, owner_name, job_name) (
      kube_job_status_start_time
      *
      on (job_name) group_left(owner_name) max by (namespace, owner_name, job_name) (kube_job_owner{owner_kind="CronJob"})
    )
    ==
    on (namespace, owner_name) group_left max by (namespace, owner_name) (
      kube_job_status_start_time
      *
      on (job_name) group_left(owner_name) max by (namespace, owner_name, job_name) (kube_job_owner{owner_kind="CronJob"})
    )
  ),
  1
)
*
on (namespace, job_name) group_left kube_job_status_failed
Run Code Online (Sandbox Code Playgroud)

owner_name 通过将标签重命名为cronjob,将表达式包围起来,可以进一步改进输出

max without (owner_name) (
  label_replace(
    <expression from above>
  ,
  "cronjob", "$1", "owner_name", "(.+)"
  )
)

Run Code Online (Sandbox Code Playgroud)

(该label_replace()函数添加cronjob标签,同时max()删除owner_name标签)

  • 您能解释一下这个查询吗? (4认同)

Tri*_*ate 5

这里最棘手的部分是cronjobs本身没有有用的状态,您必须将它们与它们创建的作业匹配。我写了一篇有关如何实现这一目标的文章:

https://medium.com/@tristan_96324/prometheus-k8s-cronjob-alerts-94bee7b90511

本文详细介绍了工作原理,但警报配置如下:

groups:
- name: kube-cron
  rules:
  - record: job_cronjob:kube_job_status_start_time:max
    expr: |
      label_replace(
        label_replace(
          max(
            kube_job_status_start_time
            * ON(exported_job) GROUP_RIGHT()
            kube_job_labels{label_cronjob!=""}
          ) BY (exported_job, label_cronjob)
          == ON(label_cronjob) GROUP_LEFT()
          max(
            kube_job_status_start_time
            * ON(exported_job) GROUP_RIGHT()
            kube_job_labels{label_cronjob!=""}
          ) BY (label_cronjob),
          "job", "$1", "exported_job", "(.+)"),
        "cronjob", "$1", "label_cronjob", "(.+)")

  - record: job_cronjob:kube_job_status_failed:sum
    expr: |
  clamp_max(
        job_cronjob:kube_job_status_start_time:max,
      1)
      * ON(job) GROUP_LEFT()
      label_replace(
        label_replace(
          (kube_job_status_failed != 0),
          "job", "$1", "exported_job", "(.+)"),
        "cronjob", "$1", "label_cronjob", "(.+)")


  - alert: CronJobStatusFailed
    expr: |
      job_cronjob:kube_job_status_failed:sum
      * ON(cronjob) GROUP_RIGHT()
      kube_cronjob_labels
      > 0
    for: 1m
    annotations:
      description: '{{ $labels.cronjob }} last run has failed {{$value }} times.'
Run Code Online (Sandbox Code Playgroud)

jobTemplate必须包含一个名为的标签cronjob,该标签与cronjob对象的名称匹配。