PostgreSQL - 处理数千个元素的数组

Juk*_*rpa 9 postgresql performance postgresql-performance

我希望根据列是否包含在作为整数数组传递的大值列表中来选择行。

这是我目前使用的查询:

SELECT item_id, other_stuff, ...
FROM (
    SELECT
        -- Partitioned row number as we only want N rows per id
        ROW_NUMBER() OVER (PARTITION BY item_id ORDER BY start_date) AS r,
        item_id, other_stuff, ...
    FROM mytable
    WHERE
        item_id = ANY ($1) -- Integer array
        AND end_date > $2
    ORDER BY item_id ASC, start_date ASC, allowed ASC
) x
WHERE x.r <= 12
Run Code Online (Sandbox Code Playgroud)

该表的结构如下:

    Column     |            Type             | Collation | Nullable | Default 
---------------+-----------------------------+-----------+----------+---------
 item_id       | integer                     |           | not null | 
 allowed       | boolean                     |           | not null | 
 start_date    | timestamp without time zone |           | not null | 
 end_date      | timestamp without time zone |           | not null | 
 ...


 Indexes:
    "idx_dtr_query" btree (item_id, start_date, allowed, end_date)
    ...
Run Code Online (Sandbox Code Playgroud)

在尝试了不同的索引并运行EXPLAIN查询之后,我想出了这个索引。这是查询和排序最有效的方法。这是查询的解释分析:

Subquery Scan on x  (cost=0.56..368945.41 rows=302230 width=73) (actual time=0.021..276.476 rows=168395 loops=1)
  Filter: (x.r <= 12)
  Rows Removed by Filter: 90275
  ->  WindowAgg  (cost=0.56..357611.80 rows=906689 width=73) (actual time=0.019..248.267 rows=258670 loops=1)
        ->  Index Scan using idx_dtr_query on mytable  (cost=0.56..339478.02 rows=906689 width=73) (actual time=0.013..130.362 rows=258670 loops=1)
              Index Cond: ((item_id = ANY ('{/* 15,000 integers */}'::integer[])) AND (end_date > '2018-03-30 12:08:00'::timestamp without time zone))
Planning time: 30.349 ms
Execution time: 284.619 ms
Run Code Online (Sandbox Code Playgroud)

问题是 int 数组最多可以包含 15,000 个左右的元素,并且在这种情况下查询变得非常慢(在我的笔记本电脑上大约 800 毫秒,最近的戴尔 XPS)。

我认为将 int 数组作为参数传递可能会很慢,并且考虑到 id 列表可以预先存储在数据库中,我尝试这样做。我将它们存储在另一个表中的数组中并使用item_id = ANY (SELECT UNNEST(item_ids) FROM ...),这比我当前的方法慢。我还尝试逐行存储它们并使用item_id IN (SELECT item_id FROM ...),这甚至更慢,即使表中只有与我的测试用例相关的行。

有没有更好的方法来做到这一点?

更新:根据埃文的评论,我尝试了另一种方法:每个项目都是几个组的一部分,因此我没有传递组的项目 ID,而是尝试在 mytable 中添加组 ID:

    Column     |            Type             | Collation | Nullable | Default 
---------------+-----------------------------+-----------+----------+---------
 item_id       | integer                     |           | not null | 
 allowed       | boolean                     |           | not null | 
 start_date    | timestamp without time zone |           | not null | 
 end_date      | timestamp without time zone |           | not null | 
 group_ids     | integer[]                   |           | not null | 
 ...

 Indexes:
    "idx_dtr_query" btree (item_id, start_date, allowed, end_date)
    "idx_dtr_group_ids" gin (group_ids)
    ...
Run Code Online (Sandbox Code Playgroud)

新查询($1 是目标组 ID):

SELECT item_id, other_stuff, ...
FROM (
    SELECT
        -- Partitioned row number as we only want N rows per id
        ROW_NUMBER() OVER (PARTITION BY item_id ORDER BY start_date) AS r,
        item_id, other_stuff, ...
    FROM mytable
    WHERE
        $1 = ANY (group_ids)
        AND end_date > $2
    ORDER BY item_id ASC, start_date ASC, allowed ASC
) x
WHERE x.r <= 12
Run Code Online (Sandbox Code Playgroud)

解释分析:

Subquery Scan on x  (cost=123356.60..137112.58 rows=131009 width=74) (actual time=811.337..1087.880 rows=172023 loops=1)
  Filter: (x.r <= 12)
  Rows Removed by Filter: 219726
  ->  WindowAgg  (cost=123356.60..132199.73 rows=393028 width=74) (actual time=811.330..1040.121 rows=391749 loops=1)
        ->  Sort  (cost=123356.60..124339.17 rows=393028 width=74) (actual time=811.311..868.127 rows=391749 loops=1)
              Sort Key: item_id, start_date, allowed
              Sort Method: external sort  Disk: 29176kB
              ->  Seq Scan on mytable (cost=0.00..69370.90 rows=393028 width=74) (actual time=0.105..464.126 rows=391749 loops=1)
                    Filter: ((end_date > '2018-04-06 12:00:00'::timestamp without time zone) AND (2928 = ANY (group_ids)))
                    Rows Removed by Filter: 1482567
Planning time: 0.756 ms
Execution time: 1098.348 ms
Run Code Online (Sandbox Code Playgroud)

索引可能有改进的余地,但我很难理解 postgres 如何使用它们,所以我不确定要改变什么。

Eva*_*oll 1

有更好的方法吗?

是的,使用临时表。当您的查询如此疯狂时,创建索引临时表并没有什么问题。

BEGIN;
  CREATE TEMP TABLE myitems ( item_id int PRIMARY KEY );
  INSERT INTO myitems(item_id) VALUES (1), (2); -- and on and on
  CREATE INDEX ON myitems(item_id);
COMMIT;

ANALYZE myitems;

SELECT item_id, other_stuff, ...
FROM (
  SELECT
      -- Partitioned row number as we only want N rows per id
      ROW_NUMBER() OVER (PARTITION BY item_id ORDER BY start_date) AS r,
      item_id, other_stuff, ...
  FROM mytable
  INNER JOIN myitems USING (item_id)
  WHERE end_date > $2
  ORDER BY item_id ASC, start_date ASC, allowed ASC
) x
WHERE x.r <= 12;
Run Code Online (Sandbox Code Playgroud)

但甚至比这更好...

“500k 不同的 item_id” ...“int 数组最多可包含 15,000 个元素”

您将单独选择数据库的 3%。我想知道在架构本身中创建组/标签等是否不是更好。我个人从未需要在查询中发送 15,000 个不同的 ID。