Tim*_*san 5 postgresql performance postgresql-9.5 query-performance
好的,我之前问过一个关于大型数据集的问题,但从未得到回答,所以我决定将其删减并询问先前设置的较小子集,并简化我在新问题中尝试完成的任务 - 希望这会清楚一点。
我有一个大表 ( report_drugs),它在磁盘上有 1775 MB,包含略多于 3300 万行。餐桌布置:
Column | Type | Modifiers
---------------+-----------------------------+-----------
rid | integer | not null
drug | integer | not null
created | timestamp without time zone |
reason | text |
duration | integer |
drugseq | integer |
effectiveness | integer |
Indexes:
"report_drugs_drug_idx" btree (drug) CLUSTER
"report_drugs_drug_rid_idx" btree (drug, rid)
"report_drugs_reason_idx" btree (reason)
"report_drugs_reason_rid_idx" btree (reason, rid)
"report_drugs_rid_idx" btree (rid)
Run Code Online (Sandbox Code Playgroud)
正如您所看到的,我有几个索引(并非都与这个问题相关)并且已经CLUSTER在drug列索引上编辑了表,因为这主要用于范围。VACUUM ANALYZE在获取任何指标之前,该表也是由我自动和手动创建的。
还有一个像这样的简单查询:
SELECT drug, reason FROM report_drugs WHERE drug = ANY(VALUES (9557), (17848),
(17880), (18223), (18550), (19020), (19084), (19234), (21295), (21742),
(23085), (26017), (27016), (29317), (33566), (35818), (37394), (39971),
(41505), (42162), (44000), (45168), (47386), (48848), (51472), (51570),
(51802), (52489), (52848), (53663), (54591), (55506), (55922), (57209),
(57671), (59311), (62022), (62532), (63485), (64134), (66236), (67394),
(67586), (68134), (68934), (70035), (70589), (70896), (73466), (75931),
(78686), (78985), (79217), (83294), (83619), (84964), (85831), (88330),
(89998), (90440), (91171), (91698), (91886), (91887), (93219), (93766),
(94009), (96341), (101475), (104623), (104973), (105216), (105496),
(106428), (110412), (119567), (121154));
Run Code Online (Sandbox Code Playgroud)
将需要超过 7 秒才能完成并具有以下查询计划:
Nested Loop (cost=1.72..83532.00 rows=24164 width=26) (actual time=0.947..7385.490 rows=264610 loops=1)
-> HashAggregate (cost=1.16..1.93 rows=77 width=4) (actual time=0.017..0.036 rows=77 loops=1)
Group Key: "*VALUES*".column1
-> Values Scan on "*VALUES*" (cost=0.00..0.96 rows=77 width=4) (actual time=0.001..0.007 rows=77 loops=1)
-> Index Scan using report_drugs_drug_idx on report_drugs (cost=0.56..1081.67 rows=314 width=26) (actual time=0.239..95.568 rows=3436 loops=77)
Index Cond: (drug = "*VALUES*".column1)
Planning time: 7.009 ms
Execution time: 7393.408 ms
Run Code Online (Sandbox Code Playgroud)
我添加到我的ANY(VALUES(..))子句中的值越多,它变得越慢。此查询有时可能包含 200 多个值,然后需要 30 多秒才能完成。然而,仅包含几个值(例如 4 个)就可以在 200 毫秒内进行查询。因此,显然是该WHERE条款的这一部分导致了这种放缓。
我该怎么做才能使此查询性能更好?我在这里遗漏了哪些明显的要点?
我的硬件和数据库设置:
我正在从 SSD 驱动器运行集群。该系统总内存为 24 GB,在 Debian 上运行,并使用 4Ghz 8 核 i7-4790 处理器。对于这种数据集,它应该是足够的硬件。
一些重要的postgresql.conf读数:
一个附带问题:
以前我使用过WHERE drug = ANY(ARRAY[..]),但我发现使用WHERE drug = ANY(VALUES(..))可以显着提高速度。为什么要有所作为?
编辑 1 - JOIN on VALUES 而不是 WHERE 子句
正如a_horse_with_no_name在评论中指出的那样,我尝试删除该WHERE子句并JOIN在药物值上使用 a 执行查询:
询问:
SELECT drug, reason FROM report_drugs d JOIN (VALUES (9557), (17848),
(17880), (18223), (18550), (19020), (19084), (19234), (21295), (21742),
(23085), (26017), (27016), (29317), (33566), (35818), (37394), (39971),
(41505), (42162), (44000), (45168), (47386), (48848), (51472), (51570),
(51802), (52489), (52848), (53663), (54591), (55506), (55922), (57209),
(57671), (59311), (62022), (62532), (63485), (64134), (66236), (67394),
(67586), (68134), (68934), (70035), (70589), (70896), (73466), (75931),
(78686), (78985), (79217), (83294), (83619), (84964), (85831), (88330),
(89998), (90440), (91171), (91698), (91886), (91887), (93219), (93766),
(94009), (96341), (101475), (104623), (104973), (105216), (105496),
(106428), (110412), (119567), (121154)) as x(d) on x.d = d.drug;
Run Code Online (Sandbox Code Playgroud)
计划(与analyze和buffers按要求jjanes):
Nested Loop (cost=0.56..83531.04 rows=24164 width=26) (actual time=1.003..6927.080 rows=264610 loops=1)
Buffers: shared hit=12514 read=111251
-> Values Scan on "*VALUES*" (cost=0.00..0.96 rows=77 width=4) (actual time=0.000..0.059 rows=77 loops=1)
-> Index Scan using report_drugs_drug_idx on report_drugs d (cost=0.56..1081.67 rows=314 width=26) (actual time=0.217..89.551 rows=3436 loops=77)
Index Cond: (drug = "*VALUES*".column1)
Buffers: shared hit=12514 read=111251
Planning time: 7.616 ms
Execution time: 6936.466 ms
Run Code Online (Sandbox Code Playgroud)
然而,这似乎没有效果。虽然查询计划略有变化,但执行时间大致相同,查询仍然很慢。
编辑 2 - 在临时表上加入,而不是在 VALUES 上加入
按照Lennart的建议,我尝试在单个事务中创建一个临时表,用药物值填充它并加入它。虽然我获得了大约 2 秒,但查询仍然非常慢,仅超过 5 秒。
查询计划已从 a 更改为 a nested loop,hash join现在正在sequential scan对report_drugs表执行 a 。这可能是一个缺失的索引(表中的drug列report_drugs确实有一个索引......)?
Hash Join (cost=67.38..693627.71 rows=800224 width=26) (actual time=0.711..4999.222 rows=264610 loops=1)
Hash Cond: (d.drug = t.drug)
-> Seq Scan on report_drugs d (cost=0.00..560537.16 rows=33338916 width=26) (actual time=0.410..3144.117 rows=33338915 loops=1)
-> Hash (cost=35.50..35.50 rows=2550 width=4) (actual time=0.012..0.012 rows=77 loops=1)
Buckets: 4096 Batches: 1 Memory Usage: 35kB
-> Seq Scan on t (cost=0.00..35.50 rows=2550 width=4) (actual time=0.002..0.005 rows=77 loops=1)
Planning time: 7.030 ms
Execution time: 5005.621 ms
Run Code Online (Sandbox Code Playgroud)
您是否尝试过使用联接重写?就像是:
SELECT d.drug, d.reason
FROM drugs d
JOIN (VALUES (9557), (17848), (17880), (18223), (18550), (19020)
, (19084), (19234), (21295), (21742), (23085), (26017)
, ... ) as T(drug)
ON d.drug = T.drug;
Run Code Online (Sandbox Code Playgroud)
附带说明一下,您的某些索引似乎是多余的。
编辑:使用临时表
您可能还想尝试使用临时表而不是虚拟表。在交易中执行以下操作:
CREATE TABLE T (drug int not null primary key) ON COMMIT DROP;
INSERT INTO T(drug)
VALUES (9557), (17848), (17880), (18223), (18550), ...;
SELECT d.drug, d.reason
FROM drugs d
JOIN T
ON d.drug = T.drug;
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
16423 次 |
| 最近记录: |