She*_*ter 5 postgresql performance query-performance postgresql-performance
我有一个带有遗留查询的 Rails 应用程序,我想对其进行翻新。当前实现执行两个 SQL 查询:一个获取大量 ID,第二个查询使用这些 ID 并应用一些额外的连接和过滤器来获得所需的结果。
我试图用避免往返的单个查询替换它,但这样做会导致我的本地测试环境(这是完整生产数据集的副本)的性能大幅下降。新查询中似乎没有使用索引,导致全表扫描。我曾希望单个查询能够保持与原始代码相同的性能,理想情况下,由于不需要发送所有 ID,因此可以对其进行改进。
这是我实际问题的最小化版本。稍大一点的版本在讨论为什么10000个ID的列表中一个复杂的查询有更好的表现与多个热膨胀系数相比,相当于SQL选择它们?.
有一个查询需要大约 6.5 秒来计算 10000 多个 ID 的列表。您可以visible_projects
在下面的“建议查询”部分中将其视为 CTE 。然后将这些 ID 输入到此查询中:
EXPLAIN (ANALYZE, BUFFERS)
WITH visible_projects AS NOT MATERIALIZED (
SELECT
id
FROM
"projects"
WHERE
"projects"."id" IN (
-- 10000+ IDs removed
)),
visible_tasks AS MATERIALIZED (
SELECT
tasks.id
FROM
tasks
WHERE
tasks.project_id IN (
SELECT
id
FROM
visible_projects))
SELECT
COUNT(1)
FROM
visible_tasks;
Run Code Online (Sandbox Code Playgroud)
查询计划(depesz)
Aggregate (cost=1309912.31..1309912.32 rows=1 width=8) (actual time=148.661..153.739 rows=1 loops=1)
Buffers: shared hit=73107 read=22301
CTE visible_tasks
-> Gather (cost=43024.54..1308639.80 rows=56556 width=4) (actual time=46.337..137.260 rows=48557 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=73107 read=22301
-> Nested Loop (cost=42024.54..1301984.20 rows=23565 width=4) (actual time=28.871..120.682 rows=16186 loops=3)
Buffers: shared hit=73107 read=22301
-> Parallel Bitmap Heap Scan on projects (cost=42023.97..138877.16 rows=4378 width=4) (actual time=28.621..52.627 rows=3502 loops=3)
Recheck Cond: (id = ANY ('{ REMOVED_IDS }'::integer[]))
Heap Blocks: exact=3536
Buffers: shared hit=30410 read=9833
-> Bitmap Index Scan on projects_pkey (cost=0.00..42021.35 rows=10507 width=0) (actual time=35.642..35.642 rows=10507 loops=1)
Index Cond: (id = ANY ('{ REMOVED_IDS }'::integer[]))
Buffers: shared hit=30410 read=1111
-> Index Scan using test_tasks_on_project on tasks (cost=0.57..263.85 rows=182 width=8) (actual time=0.012..0.018 rows=5 loops=10507)
Index Cond: (project_id = projects.id)
Buffers: shared hit=42697 read=12468
-> CTE Scan on visible_tasks (cost=0.00..1131.12 rows=56556 width=0) (actual time=46.339..144.641 rows=48557 loops=1)
Buffers: shared hit=73107 read=22301
Planning:
Buffers: shared hit=10 read=10
Planning Time: 8.857 ms
Execution Time: 156.102 ms
Run Code Online (Sandbox Code Playgroud)
这是相同的查询结构,但我没有将 10000 多个 ID 直接插入visible_projects
CTE,而是嵌入了查找这些 ID 的 SQL。
EXPLAIN (ANALYZE, BUFFERS)
WITH visible_projects AS NOT MATERIALIZED (
SELECT
id
FROM
"projects"
WHERE
"projects"."company_id" = 11171
AND "projects"."state" < 6
AND "projects"."is_template" = FALSE),
visible_tasks AS MATERIALIZED (
SELECT
tasks.id
FROM
tasks
WHERE
tasks.project_id IN (
SELECT
id
FROM
visible_projects))
SELECT
COUNT(1)
FROM
visible_tasks;
Run Code Online (Sandbox Code Playgroud)
查询计划(depesz):
EXPLAIN (ANALYZE, BUFFERS)
WITH visible_projects AS NOT MATERIALIZED (
SELECT
id
FROM
"projects"
WHERE
"projects"."id" IN (
-- 10000+ IDs removed
)),
visible_tasks AS MATERIALIZED (
SELECT
tasks.id
FROM
tasks
WHERE
tasks.project_id IN (
SELECT
id
FROM
visible_projects))
SELECT
COUNT(1)
FROM
visible_tasks;
Run Code Online (Sandbox Code Playgroud)
即使考虑到前两个查询的总和,这也是当前实现时间的 6 倍。
我看到这已经选择使用Parallel Seq Scan on tasks
哪个是主要的贡献时间因素。我不明白的是为什么选择了这个,以及我应该怎么做才能恢复使用索引。
通过研究,我了解到 Postgres 不提供查询提示来强制使用索引,所以我认为一个好的解决方案将涉及向查询计划员证明使用索引是有益的。
我在这个问题中COUNT(1)
结合使用AS MATERIALIZED
/AS NOT MATERIALIZED
控件来生成一个较小的示例。
应用程序中较大的查询不使用这些,但它也在tasks
生成许多其他 CTE 和一些聚合指标作为最终结果之前对表执行一些过滤。
Table "public.projects"
Column | Type | Collation | Nullable | Default
----------------------------+-------------------------------+-----------+----------+--------------------------------------
id | integer | | not null | nextval('projects_id_seq'::regclass)
name | character varying(255) | | |
description | text | | |
due | timestamp without time zone | | |
created_at | timestamp without time zone | | not null |
updated_at | timestamp without time zone | | not null |
client_id | integer | | |
company_id | integer | | |
repeat | boolean | | not null | true
end_date | timestamp without time zone | | |
prev_id | integer | | |
next_id | integer | | |
completed_tasks_count | integer | | not null | 0
tasks_count | integer | | not null | 0
done_at | timestamp without time zone | | |
state | integer | | |
schedule | text | | |
start_date | timestamp without time zone | | |
manager_id | integer | | |
partner_id | integer | | |
exschedule | text | | |
extdue | timestamp without time zone | | |
is_template | boolean | | not null | false
predicted_duration | integer | | | 0
budget | integer | | | 0
cached_effective_due_date | timestamp without time zone | | |
cached_manager_fullname | character varying(255) | | | ''::character varying
cached_partner_fullname | character varying(255) | | | ''::character varying
cached_staffs_fullnames | text | | | ''::text
cached_staffs_ids | text | | | ''::text
cached_label_ids | character varying(255) | | | ''::character varying
date_in | timestamp without time zone | | |
cached_label_sum | integer | | | 0
date_out | timestamp without time zone | | |
turn_around_time | integer | | | 0
dues_calculated_at | timestamp without time zone | | |
dues | timestamp without time zone[] | | |
dues_rewind | integer[] | | |
quickbooks_item_id | integer | | |
perform_final_review | boolean | | not null | false
quickbooks_desktop_item_id | integer | | |
billing_model_type | character varying | | not null | 'staff'::character varying
series_id | integer | | |
shared | boolean | | | false
Indexes:
"projects_pkey" PRIMARY KEY, btree (id)
"index_projects_on_cached_effective_due_date" btree (cached_effective_due_date)
"index_projects_on_client_id" btree (client_id)
"index_projects_on_company_id" btree (company_id)
"index_projects_on_manager_id" btree (manager_id)
"index_projects_on_next_id" btree (next_id)
"index_projects_on_partner_id" btree (partner_id)
"index_projects_on_series_id" btree (series_id)
"index_projects_on_shared_and_is_template" btree (shared, is_template) WHERE shared = true AND is_template = true
Foreign-key constraints:
"fk_rails_243d23cb48" FOREIGN KEY (quickbooks_desktop_item_id) REFERENCES quickbooks_desktop_items(id)
"fk_rails_33ba8711de" FOREIGN KEY (quickbooks_item_id) REFERENCES quickbooks_items(id)
"fk_rails_fcf0ca7614" FOREIGN KEY (series_id) REFERENCES series(id) NOT VALID
Referenced by:
TABLE "tasks" CONSTRAINT "tasks_project_id_fkey" FOREIGN KEY (project_id) REFERENCES projects(id)
Run Code Online (Sandbox Code Playgroud)
该projects
表有 14,273,833 行。
is_template = true
Table "public.tasks"
Column | Type | Collation | Nullable | Default
-------------------------+-----------------------------+-----------+----------+-----------------------------------
id | integer | | not null | nextval('tasks_id_seq'::regclass)
name | character varying(255) | | |
description | text | | |
duedate | timestamp without time zone | | |
created_at | timestamp without time zone | | not null |
updated_at | timestamp without time zone | | not null |
project_id | integer | | not null |
done | boolean | | not null | false
position | integer | | |
done_at | timestamp without time zone | | |
dueafter | integer | | |
done_by_user_id | integer | | |
predicted_duration | integer | | |
auto_predicted_duration | integer | | | 0
assignable_id | integer | | |
assignable_type | character varying | | |
will_assign_to_client | boolean | | not null | false
Indexes:
"tasks_pkey" PRIMARY KEY, btree (id)
"index_tasks_on_assignable_type_and_assignable_id" btree (assignable_type, assignable_id)
"index_tasks_on_done_by_user_id" btree (done_by_user_id)
"index_tasks_on_duedate" btree (duedate)
"test_tasks_on_project" btree (project_id)
Foreign-key constraints:
"tasks_project_id_fkey" FOREIGN KEY (project_id) REFERENCES projects(id)
Run Code Online (Sandbox Code Playgroud)
该tasks
表有 76,716,433 行。
不同查询计划的主要原因可能是 Postgres估计要返回的行数增加projects
:
(cost=0.00..42021.35 rows=10507 width=0) (actual time=35.642..35.642 rows=10507 loops=1)
与
(cost=0.43..277961.56 rows=31322 width=4) (actual time=0.591..6970.696 rows=10507 loops=3)
高估了 3 倍,这并不引人注目,但显然足以支持不同的(较差的)查询计划。有关的:
假设projects.is_template
主要是false
,我建议使用这些多列索引:
CREATE INDEX ON projects(company_id, state);
Run Code Online (Sandbox Code Playgroud)
首先是平等,其次是范围。看:
您还可以尝试增加 、 和表的统计目标company_id
,state
以ANALYZE
获得更好的估计。
和:
CREATE INDEX ON tasks (project_id, id);
Run Code Online (Sandbox Code Playgroud)
另外增加tasks.project_id
和的统计目标ANALYZE
。
在这两种情况下,多列索引都可以替换project.company_id
/上的索引task.project_id
。由于所有列都是integer
,索引的大小将是相同的 - 除了索引重复数据删除的效果(使用 Postgres 13 添加),这在高度重复的测试tasks.project_id
中表现得很强烈。看:
这个查询:
SELECT t.id
FROM projects p
JOIN tasks t ON t.project_id = p.id
WHERE p.company_id = 11171
AND p.state < 6
AND p.is_template = FALSE;
Run Code Online (Sandbox Code Playgroud)
直接连接应该更快。
归档时间: |
|
查看次数: |
113 次 |
最近记录: |