Grafana 从 Loki 查询大量日志时超时

Tan*_*may 12 grafana grafana-loki

我有一个在 AWS Graviton(arm、4 个 vCPU、8 GiB)上运行的 Loki 服务器,配置如下:

common:
  replication_factor: 1
  ring:
    kvstore:
      store: etcd
      etcd:
        endpoints: ['127.0.0.1:2379']

storage_config:
  boltdb_shipper:
   active_index_directory: /opt/loki/index
   cache_location: /opt/loki/index_cache
   shared_store: s3

  aws:
    s3: s3://ap-south-1/bucket-name

limits_config:
  enforce_metric_name: false
  reject_old_samples: true
  reject_old_samples_max_age: 168h # 7d
  ingestion_rate_mb: 10
  ingestion_burst_size_mb: 20
  per_stream_rate_limit: 8MB
  
ingester:
  lifecycler:
    join_after: 30s
  chunk_block_size: 10485760

compactor:
  working_directory: /opt/loki/compactor
  shared_store: s3
  compaction_interval: 5m

schema_config:
  configs:
    - from: 2022-01-01
      store: boltdb-shipper
      object_store: s3
      schema: v11
      index:
        prefix: loki_
        period: 24h

table_manager:
  retention_period: 360h #15d
  retention_deletes_enabled: true
  index_tables_provisioning: # unused
    provisioned_write_throughput: 500
    provisioned_read_throughput: 100
    inactive_write_throughput: 1
    inactive_read_throughput: 100
Run Code Online (Sandbox Code Playgroud)

摄取工作正常,我能够从数据大小较小的流中查询长时间的日志。我还可以查询具有 TiB 数据流的短持续时间日志。

当我尝试从大数据流中查询 24 小时的数据并且 Grafana 在 5 分钟后超时时,我在 Loki 中看到以下错误:

Feb 11 08:27:32 loki-01 loki[19490]: level=error ts=2022-02-11T08:27:32.186137309Z caller=retry.go:73 org_id=fake msg="error processing request" try=2 err="context canceled"
Feb 11 08:27:32 loki-01 loki[19490]: level=info ts=2022-02-11T08:27:32.186304708Z caller=metrics.go:92 org_id=fake latency=fast query="{filename=\"/var/log/server.log\",host=\"web-199\",ip=\"192.168.20.239\",name=\"web\"} |= \"attachDriver\"" query_type=filter range_type=range length=24h0m0s step=1m0s duration=0s status=499 limit=1000 returned_lines=0 throughput=0B total_bytes=0B
Feb 11 08:27:32 loki-01 loki[19490]: level=info ts=2022-02-11T08:27:32.23882892Z caller=metrics.go:92 org_id=fake latency=slow query="{filename=\"/var/log/server.log\",host=\"web-199\",ip=\"192.168.20.239\",name=\"web\"} |= \"attachDriver\"" query_type=filter range_type=range length=24h0m0s step=1m0s duration=59.813829694s status=400 limit=1000 returned_lines=153 throughput=326MB total_bytes=20GB
Feb 11 08:27:32 loki-01 loki[19490]: level=error ts=2022-02-11T08:27:32.238959314Z caller=scheduler_processor.go:199 org_id=fake msg="error notifying frontend about finished query" err="rpc error: code = Canceled desc = context canceled" frontend=192.168.5.138:9095
Feb 11 08:27:32 loki-01 loki[19490]: level=error ts=2022-02-11T08:27:32.23898877Z caller=scheduler_processor.go:154 org_id=fake msg="error notifying scheduler about finished query" err=EOF addr=192.168.5.138:9095
Run Code Online (Sandbox Code Playgroud)

询问:{filename="/var/log/server.log",host="web-199",ip="192.168.20.239",name="web"} |= "attachDriver"

有没有办法传输结果而不是等待响应?我可以优化 Loki 以更好地处理此类查询吗?

gau*_*tam 1

Grafana Loki 超时增加配置中的超时时间。