Java Micrometer - 如何处理 *_bucket 类型的指标

Pat*_*Pat 3 java grafana spring-boot-actuator spring-micrometer

请快速询问有关 *_bucket 类型指标的问题。

我的应用程序生成指标,如下所示:


# HELP http_server_requests_seconds  
# TYPE http_server_requests_seconds histogram
http_server_requests_seconds_bucket{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/health",le="0.005592405",} 273.0
http_server_requests_seconds_bucket{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/health",le="0.006990506",} 797.0
http_server_requests_seconds_bucket{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/health",le="0.008388607",} 2638.0
http_server_requests_seconds_bucket{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/health",le="0.009786708",} 3543.0
http_server_requests_seconds_bucket{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/health",le="0.011184809",} 3932.0
http_server_requests_seconds_bucket{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/health",le="0.01258291",} 4154.0
http_server_requests_seconds_bucket{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/health",le="0.013981011",} 4279.0
http_server_requests_seconds_bucket{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/health",le="0.015379112",} 4380.0

Run Code Online (Sandbox Code Playgroud)

# HELP resilience4j_circuitbreaker_calls_seconds Total number of successful calls
# TYPE resilience4j_circuitbreaker_calls_seconds histogram
resilience4j_circuitbreaker_calls_seconds_bucket{kind="successful",name="someName",le="0.001",} 0.0
resilience4j_circuitbreaker_calls_seconds_bucket{kind="successful",name="someName",le="0.001048576",} 0.0
resilience4j_circuitbreaker_calls_seconds_bucket{kind="successful",name="someName",le="0.001398101",} 0.0
resilience4j_circuitbreaker_calls_seconds_bucket{kind="successful",name="someName",le="0.001747626",} 0.0
resilience4j_circuitbreaker_calls_seconds_bucket{kind="successful",name="someName",le="0.002097151",} 0.0
resilience4j_circuitbreaker_calls_seconds_bucket{kind="successful",name="someName",le="0.002446676",} 0.0
resilience4j_circuitbreaker_calls_seconds_bucket{kind="successful",name="someName",le="0.002796201",} 0.0
Run Code Online (Sandbox Code Playgroud)

我相信它们确实有用,但不幸的是,我不知道如何处理它们。

我尝试了一些查询,例如rate(http_server_requests_seconds{_bucket_=\"+Inf\", status=~\"2..\"}[5m]),但似乎没有带来任何有价值的结果。

请问使用 *_bucket 类型的指标的正确方法是什么,例如,如何构建最适合这些 *_bucket 的 Grafana 仪表板和视觉效果?

谢谢

小智 6

您可以使用此指标找到给定端点延迟的第 99 个百分点/第 95 个百分点,并可以使用 histogram_quantile 函数来实现。例如对于第 99 个百分位:

histogram_quantile(
  0.99, 
  sum(
    rate(
      http_server_requests_seconds_bucket{exception="None", uri = "/your-uri"}[5m])
  ) by (le)
)
Run Code Online (Sandbox Code Playgroud)

对于第 95 个百分位数:

histogram_quantile(
  0.95, 
  sum(
    rate(http_server_requests_seconds_bucket{exception="None", uri = "/your-uri"}[5m])
  ) by (le)
)
Run Code Online (Sandbox Code Playgroud)

更多信息:参考文献中的一个很好的片段: https: //idanlupinsky.com/blog/application-monitoring-with-micrometer-prometheus-grafana-and-cloudwatch/

直方图是存储桶(或计数器)的集合,每个存储桶维护在 le 标记指定的持续时间内观察到的事件数量。让我们看一下我们的演示应用程序发布的直方图的一部分:

http_server_requests_seconds_bucket{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/demo",le="0.067108864",} 0.0
http_server_requests_seconds_bucket{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/demo",le="0.089478485",} 0.0
http_server_requests_seconds_bucket{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/demo",le="0.111848106",} 92382.0
http_server_requests_seconds_bucket{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/demo",le="0.134217727",} 99050.0
http_server_requests_seconds_bucket{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/demo",le="0.156587348",} 99703.0
...
http_server_requests_seconds_bucket{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/demo",le="0.984263336",} 99987.0
http_server_requests_seconds_bucket{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/demo",le="1.0",} 99987.0
http_server_requests_seconds_bucket{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/demo",le="+Inf",} 100000.0
Run Code Online (Sandbox Code Playgroud)

上面列表中的第二行表明没有观察到耗时长达约 89 毫秒(由标签指定le)的请求。鉴于处理请求时有 100 毫秒的睡眠时间,这是预期的。第 3 行显示观察到 92,382 个请求,其持续时间长达约 111 毫秒。请注意,直方图是累积的,并且请求的全部计数落在最后一个存储桶中,没有上限le="+Inf"