Juk*_*rpa 9 postgresql performance postgresql-performance
我希望根据列是否包含在作为整数数组传递的大值列表中来选择行。
这是我目前使用的查询:
SELECT item_id, other_stuff, ...
FROM (
SELECT
-- Partitioned row number as we only want N rows per id
ROW_NUMBER() OVER (PARTITION BY item_id ORDER BY start_date) AS r,
item_id, other_stuff, ...
FROM mytable
WHERE
item_id = ANY ($1) -- Integer array
AND end_date > $2
ORDER BY item_id ASC, start_date ASC, allowed ASC
) x
WHERE x.r <= 12
Run Code Online (Sandbox Code Playgroud)
该表的结构如下:
Column | Type | Collation | Nullable | Default
---------------+-----------------------------+-----------+----------+---------
item_id | integer | | not null |
allowed | boolean | | not null |
start_date | timestamp without time zone | | not null |
end_date | timestamp without time zone | | not null |
...
Indexes:
"idx_dtr_query" btree (item_id, start_date, allowed, end_date)
...
Run Code Online (Sandbox Code Playgroud)
在尝试了不同的索引并运行EXPLAIN查询之后,我想出了这个索引。这是查询和排序最有效的方法。这是查询的解释分析:
Subquery Scan on x (cost=0.56..368945.41 rows=302230 width=73) (actual time=0.021..276.476 rows=168395 loops=1)
Filter: (x.r <= 12)
Rows Removed by Filter: 90275
-> WindowAgg (cost=0.56..357611.80 rows=906689 width=73) (actual time=0.019..248.267 rows=258670 loops=1)
-> Index Scan using idx_dtr_query on mytable (cost=0.56..339478.02 rows=906689 width=73) (actual time=0.013..130.362 rows=258670 loops=1)
Index Cond: ((item_id = ANY ('{/* 15,000 integers */}'::integer[])) AND (end_date > '2018-03-30 12:08:00'::timestamp without time zone))
Planning time: 30.349 ms
Execution time: 284.619 ms
Run Code Online (Sandbox Code Playgroud)
问题是 int 数组最多可以包含 15,000 个左右的元素,并且在这种情况下查询变得非常慢(在我的笔记本电脑上大约 800 毫秒,最近的戴尔 XPS)。
我认为将 int 数组作为参数传递可能会很慢,并且考虑到 id 列表可以预先存储在数据库中,我尝试这样做。我将它们存储在另一个表中的数组中并使用item_id = ANY (SELECT UNNEST(item_ids) FROM ...),这比我当前的方法慢。我还尝试逐行存储它们并使用item_id IN (SELECT item_id FROM ...),这甚至更慢,即使表中只有与我的测试用例相关的行。
有没有更好的方法来做到这一点?
更新:根据埃文的评论,我尝试了另一种方法:每个项目都是几个组的一部分,因此我没有传递组的项目 ID,而是尝试在 mytable 中添加组 ID:
Column | Type | Collation | Nullable | Default
---------------+-----------------------------+-----------+----------+---------
item_id | integer | | not null |
allowed | boolean | | not null |
start_date | timestamp without time zone | | not null |
end_date | timestamp without time zone | | not null |
group_ids | integer[] | | not null |
...
Indexes:
"idx_dtr_query" btree (item_id, start_date, allowed, end_date)
"idx_dtr_group_ids" gin (group_ids)
...
Run Code Online (Sandbox Code Playgroud)
新查询($1 是目标组 ID):
SELECT item_id, other_stuff, ...
FROM (
SELECT
-- Partitioned row number as we only want N rows per id
ROW_NUMBER() OVER (PARTITION BY item_id ORDER BY start_date) AS r,
item_id, other_stuff, ...
FROM mytable
WHERE
$1 = ANY (group_ids)
AND end_date > $2
ORDER BY item_id ASC, start_date ASC, allowed ASC
) x
WHERE x.r <= 12
Run Code Online (Sandbox Code Playgroud)
解释分析:
Subquery Scan on x (cost=123356.60..137112.58 rows=131009 width=74) (actual time=811.337..1087.880 rows=172023 loops=1)
Filter: (x.r <= 12)
Rows Removed by Filter: 219726
-> WindowAgg (cost=123356.60..132199.73 rows=393028 width=74) (actual time=811.330..1040.121 rows=391749 loops=1)
-> Sort (cost=123356.60..124339.17 rows=393028 width=74) (actual time=811.311..868.127 rows=391749 loops=1)
Sort Key: item_id, start_date, allowed
Sort Method: external sort Disk: 29176kB
-> Seq Scan on mytable (cost=0.00..69370.90 rows=393028 width=74) (actual time=0.105..464.126 rows=391749 loops=1)
Filter: ((end_date > '2018-04-06 12:00:00'::timestamp without time zone) AND (2928 = ANY (group_ids)))
Rows Removed by Filter: 1482567
Planning time: 0.756 ms
Execution time: 1098.348 ms
Run Code Online (Sandbox Code Playgroud)
索引可能有改进的余地,但我很难理解 postgres 如何使用它们,所以我不确定要改变什么。
有更好的方法吗?
是的,使用临时表。当您的查询如此疯狂时,创建索引临时表并没有什么问题。
BEGIN;
CREATE TEMP TABLE myitems ( item_id int PRIMARY KEY );
INSERT INTO myitems(item_id) VALUES (1), (2); -- and on and on
CREATE INDEX ON myitems(item_id);
COMMIT;
ANALYZE myitems;
SELECT item_id, other_stuff, ...
FROM (
SELECT
-- Partitioned row number as we only want N rows per id
ROW_NUMBER() OVER (PARTITION BY item_id ORDER BY start_date) AS r,
item_id, other_stuff, ...
FROM mytable
INNER JOIN myitems USING (item_id)
WHERE end_date > $2
ORDER BY item_id ASC, start_date ASC, allowed ASC
) x
WHERE x.r <= 12;
Run Code Online (Sandbox Code Playgroud)
但甚至比这更好...
“500k 不同的 item_id” ...“int 数组最多可包含 15,000 个元素”
您将单独选择数据库的 3%。我想知道在架构本身中创建组/标签等是否不是更好。我个人从未需要在查询中发送 15,000 个不同的 ID。
| 归档时间: |
|
| 查看次数: |
1812 次 |
| 最近记录: |