PostgreSQL：获取每个时间间隔的最新行

Question

PostgreSQL：获取每个时间间隔的最新行

mhv*_*vis 3 sql postgresql datetime timescaledb

我有下表。它存储为 TimescaleDB 超表。数据速率为每秒 1 行。

CREATE TABLE electricity_data
(
    "time" timestamptz NOT NULL,
    meter_id integer REFERENCES meters NOT NULL,
    import_low double precision,
    import_normal double precision,
    export_low double precision,
    export_normal double precision,
    PRIMARY KEY ("time", meter_id)
)

Run Code Online (Sandbox Code Playgroud)

我想获取一段时间内给定时间间隔内的最新行。例如上一年每个月的最新记录。以下查询可以工作，但速度很慢：

EXPLAIN ANALYZE
SELECT
DISTINCT ON (bucket)
time_bucket('1 month', "time", 'Europe/Amsterdam') AS bucket,
import_low,
import_normal,
export_low,
export_normal
FROM electricity_data
WHERE meter_id = 1
AND "time" BETWEEN '2022-01-01T00:00:00 Europe/Amsterdam' AND '2023-01-01T00:00:00 Europe/Amsterdam'
ORDER BY bucket DESC

Run Code Online (Sandbox Code Playgroud)

Unique  (cost=0.42..542380.99 rows=200 width=40) (actual time=3654.263..59130.398 rows=12 loops=1)
  ->  Custom Scan (ChunkAppend) on electricity_data  (cost=0.42..514045.41 rows=11334231 width=40) (actual time=3654.260..58255.396 rows=11161474 loops=1)
        Order: time_bucket('1 mon'::interval, electricity_data.""time"", 'Europe/Amsterdam'::text, NULL::timestamp with time zone, NULL::interval) DESC
        ->  Index Scan using _hyper_12_1533_chunk_electricity_data_time_idx on _hyper_12_1533_chunk  (cost=0.42..11530.51 rows=255951 width=40) (actual time=3654.253..3986.885 rows=255582 loops=1)
              Index Cond: ((""time"" >= '2021-12-31 23:00:00+00'::timestamp with time zone) AND (""time"" <= '2022-12-31 23:00:00+00'::timestamp with time zone))
              Filter: (meter_id = 1)
              Rows Removed by Filter: 24330
        ->  Index Scan Backward using ""1529_1849_electricity_data_pkey"" on _hyper_12_1529_chunk  (cost=0.42..25777.81 rows=604553 width=40) (actual time=1.468..1810.493 rows=603808 loops=1)
              Index Cond: ((""time"" >= '2021-12-31 23:00:00+00'::timestamp with time zone) AND (""time"" <= '2022-12-31 23:00:00+00'::timestamp with time zone) AND (meter_id = 1))
(...)
Planning Time: 57.424 ms
JIT:
  Functions: 217
  Options: Inlining true, Optimization true, Expressions true, Deforming true
  Timing: Generation 43.496 ms, Inlining 18.805 ms, Optimization 2348.206 ms, Emission 1288.087 ms, Total 3698.594 ms
Execution Time: 59176.016 ms

Run Code Online (Sandbox Code Playgroud)

立即获取单月的最新行：

EXPLAIN ANALYZE
SELECT
"time",
import_low,
import_normal,
export_low,
export_normal
FROM electricity_data
WHERE meter_id = 1
AND "time" BETWEEN '2022-12-01T00:00:00 Europe/Amsterdam' AND '2023-01-01T00:00:00 Europe/Amsterdam'
ORDER BY "time" DESC
LIMIT 1

Run Code Online (Sandbox Code Playgroud)

Limit  (cost=0.42..0.47 rows=1 width=40) (actual time=0.048..0.050 rows=1 loops=1)
  ->  Custom Scan (ChunkAppend) on electricity_data  (cost=0.42..11530.51 rows=255951 width=40) (actual time=0.047..0.048 rows=1 loops=1)
        Order: electricity_data.""time"" DESC
        ->  Index Scan using _hyper_12_1533_chunk_electricity_data_time_idx on _hyper_12_1533_chunk  (cost=0.42..11530.51 rows=255951 width=40) (actual time=0.046..0.046 rows=1 loops=1)
              Index Cond: ((""time"" >= '2022-11-30 23:00:00+00'::timestamp with time zone) AND (""time"" <= '2022-12-31 23:00:00+00'::timestamp with time zone))
              Filter: (meter_id = 1)
        ->  Index Scan Backward using ""1529_1849_electricity_data_pkey"" on _hyper_12_1529_chunk  (cost=0.42..25777.81 rows=604553 width=40) (never executed)
              Index Cond: ((""time"" >= '2022-11-30 23:00:00+00'::timestamp with time zone) AND (""time"" <= '2022-12-31 23:00:00+00'::timestamp with time zone) AND (meter_id = 1))
(...)
        ->  Index Scan using _hyper_12_1512_chunk_electricity_data_time_idx on _hyper_12_1512_chunk  (cost=0.42..8.94 rows=174 width=40) (never executed)
              Index Cond: ((""time"" >= '2022-11-30 23:00:00+00'::timestamp with time zone) AND (""time"" <= '2022-12-31 23:00:00+00'::timestamp with time zone))
              Filter: (meter_id = 1)
Planning Time: 2.162 ms
Execution Time: 0.152 ms

Run Code Online (Sandbox Code Playgroud)

有没有办法为每个月或自定义时间间隔执行上面的查询？或者有其他方法可以加快第一个查询的速度吗？

编辑

下面的查询需要 10 秒，这比手动方法要好得多，但仍然慢。索引似乎没有什么区别。

EXPLAIN ANALYZE
SELECT MAX("time") AS "time"
FROM electricity_data
WHERE meter_id = 1
    AND "time" >= '2022-01-01T00:00:00 Europe/Amsterdam'
    AND "time" < '2023-01-01T00:00:00 Europe/Amsterdam'
GROUP BY time_bucket('1 month', "time", 'Europe/Amsterdam');

Run Code Online (Sandbox Code Playgroud)

(... plan removed)
Planning Time: 50.463 ms
JIT:
  Functions: 451
  Options: Inlining false, Optimization false, Expressions true, Deforming true
  Timing: Generation 76.476 ms, Inlining 0.000 ms, Optimization 13.849 ms, Emission 416.718 ms, Total 507.043 ms
Execution Time: 9910.058 ms

Run Code Online (Sandbox Code Playgroud)

Answer 1

dav*_*idk 7

我建议使用last聚合和连续聚合来解决这个问题。

与上一张海报一样，我还建议在计量、时间上建立索引，而不是相反，您可以通过更改主键定义中键的顺序来在表定义中执行此操作。

CREATE TABLE electricity_data
(
    "time" timestamptz NOT NULL,
    meter_id integer REFERENCES meters NOT NULL,
    import_low double precision,
    import_normal double precision,
    export_low double precision,
    export_normal double precision,
    PRIMARY KEY ( meter_id, "time")
);

Run Code Online (Sandbox Code Playgroud)

但这有点偏离主题了。您想要执行的基本查询类似于：

SELECT time_bucket('1 day', "time", 'Europe/Amsterdam'), 
    meter_id, 
    last(electricity_data, "time") 
FROM electricity_data 
GROUP BY 1, 2;

Run Code Online (Sandbox Code Playgroud)

这有点令人困惑，直到您意识到表本身也是 PostgreSQL 中的一种类型- 因此您可以从对聚合的调用中请求并返回复合类型last，这将获得月份或日期或任何您想要的值。想。

然后你必须能够再次将其视为一行，这样你就可以通过使用括号和 .* 来扩展它，这就是在 PG 中扩展复合类型的方式。

SELECT time_bucket('1 month', "time", 'Europe/Amsterdam'),
    meter_id, 
    (last(electricity_data, "time")).*
FROM electricity_data 
GROUP BY 1,2;

Run Code Online (Sandbox Code Playgroud)

现在，为了加快速度，您可以将其转变为连续聚合，这将使速度更快。

CREATE MATERIALIZED VIEW last_meter_month WITH (timescaledb.continuous) AS
SELECT time_bucket('1 month', "time", 'Europe/Amsterdam'),
    (last(electricity_data, "time")).*
FROM electricity_data 
GROUP BY 1, meter_id;

Run Code Online (Sandbox Code Playgroud)

您会注意到，我从初始选择列表中取出了meter_id，因为它将来自我们的复合类型，并且我不需要冗余列，也不能在视图中拥有两个同名的列，但我做到了将meter_id保留在我的组中。

这样会很好地加快速度，但是，如果我是你，我实际上可能会考虑每天都这样做，并为此类事情创建一个分层的连续聚合。

CREATE MATERIALIZED VIEW last_meter_day WITH (timescaledb.continuous) AS
SELECT time_bucket('1 day', "time", 'Europe/Amsterdam'),
    (last(electricity_data, "time")).*
FROM electricity_data 
GROUP BY 1, meter_id;

CREATE MATERIALIZED VIEW last_meter_month WITH (timescaledb.continuous) AS
SELECT time_bucket('1 month',time_bucket, 'Europe/Amsterdam') as month_bucket,
    (last(last_meter_day, time_bucket)).*
FROM last_meter_day 
GROUP BY 1, meter_id;

Run Code Online (Sandbox Code Playgroud)

原因是我们无法真正频繁地刷新每月连续聚合，刷新每日聚合然后更频繁地将其汇总到每月聚合中要容易得多。您也可以在查询中只进行每日聚合并动态汇总到月份，因为每米最多 30 天，但当然，这不会那么高效。

然后，您必须根据刷新时希望发生的情况为这些创建连续的聚合策略。

我还建议，根据您想用它做什么，您可能想看一下，counter_agg 因为它可能对您有用。我最近还在我们的论坛上写了一篇关于如何将其与电表一起使用的帖子，这可能对您有所帮助，具体取决于您处理这些数据的方式。

归档时间：	3 年，4 月前
查看次数：	514 次
最近记录：	3 年，4 月前