丢失系列/数据的警报

Question

丢失系列/数据的警报

我试图了解如何在不再抓取指标时让 Grafana 提醒我。

我在这个例子中使用的指标是mongodb_instance_uptime_seconds. 当实例出现故障时，不再生成指标，导致 Prometheus 中缺少指标。目前警报在上触发when last() query(A, 1m, now) < 600。如您所见，目标是在正常运行时间低于 5 分钟时发出警报。这意味着我想提醒重新启动和停止，但 Grafana 不会在一个实例关闭时发出警报，因为该last()值实际上不存在，并且当实例关闭超过 5 分钟时，它甚至不再报告。

关于如何前进的任何线索？

Answer 1

val*_*ala 6

如果您事先知道受监控时间序列的所有标签，则可以使用absent_over_time函数进行警报。例如，当指标mongodb_instance_uptime_seconds{instance="foo",job="bar"}在过去 5 分钟内没有新样本时，以下查询将返回非空结果（例如警报）：

absent_over_time(mongodb_instance_uptime_seconds{instance="foo",job="bar"}[5m])

Run Code Online (Sandbox Code Playgroud)

不幸的是，如果某些匹配的时间序列消失，absent和absent_over_time函数都无法返回多个结果。例如，如果有两个时间序列：

mongodb_instance_uptime_seconds{instance="foo"}
mongodb_instance_uptime_seconds{instance="bar"}

Run Code Online (Sandbox Code Playgroud)

并且这些时间序列中只有一个停止接收新样本（假设mongodb_instance_uptime_seconds{instance="foo"}不再有新样本，同时mongodb_instance_uptime_seconds{instance="bar"}继续接收新样本），那么以下查询将不会返回的预期警报mongodb_instance_uptime_seconds{instance="foo"}：

absent(mongodb_instance_uptime_seconds)

Run Code Online (Sandbox Code Playgroud)

absent_over_time(mongodb_instance_uptime_seconds[5m])

Run Code Online (Sandbox Code Playgroud)

Prometheus 还没有提供这个问题的解决方案，而VictoriaMetrics提供了lag()函数，可以用于这种情况下的警报。mongodb_instance_uptime_seconds例如，当至少一个名为 name 的时间序列在过去一小时内停止接收新样本超过 5 分钟时，以下 MetricsQL 查询警报（例如返回非空结果）：

lag(mongodb_instance_uptime_seconds[1h]) > 5m

Run Code Online (Sandbox Code Playgroud)

在时间序列停止接收新样本后，此警报将保持活动状态一小时。可以通过更改方括号中的值来调整活动警报的持续时间。

Answer 2

wbh*_*bh1 5

通常用于确定实例是否被成功抓取的指标是up。它是由所有抓取作业自动生成的，因此如果您想要任何关闭的抓取端点的警报，只需使用 query up == 0，它将显示上次抓取未成功的所有端点。如果您只想针对此特定端点发出警报，请使用如下标签up{instance="mongodb.foo.com",job="mongo"} == 0

如果您对此感兴趣，使用 Alertmanager 而不是 Grafana，则规则如下所示：

groups:
- name: General
  rules:
  - alert: Endpoint_Down
    expr: up == 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "Exporter is down: {{ $labels.instance }}"
      description: "The endpoint {{ $labels.instance }} is not able to be scraped by Prometheus."

Run Code Online (Sandbox Code Playgroud)

不要单独将警报设置为“up”，而是考虑将其设置为“absent”。例如： `absent(up{pod=~"deployment-name.+"})==1` (2认同)

归档时间：	7 年，4 月前
查看次数：	5351 次
最近记录：	5 年，5 月前