为什么 10,000 个 ID 的列表比使用等效的 SQL 来选择它们的性能更好?

She*_*ter 5 postgresql performance query-performance postgresql-performance

我有一个带有遗留查询的 Rails 应用程序,我想对其进行翻新。当前实现执行两个 SQL 查询:一个获取大量 ID,第二个查询使用这些 ID 并应用一些额外的连接和过滤器来获得所需的结果。

我试图用避免往返的单个查询替换它,但这样做会导致我的本地测试环境(这是完整生产数据集的副本)的性能大幅下降。新查询中似乎没有使用索引,导致全表扫描。我曾希望单个查询能够保持与原始代码相同的性能,理想情况下,由于不需要发送所有 ID,因此可以对其进行改进

这是我实际问题的最小化版本。稍大一点的版本在讨论为什么10000个ID的列表中一个复杂的查询有更好的表现与多个热膨胀系数相比,相当于SQL选择它们?.

当前查询

有一个查询需要大约 6.5 秒来计算 10000 多个 ID 的列表。您可以visible_projects在下面的“建议查询”部分中将其视为 CTE 。然后将这些 ID 输入到此查询中:

EXPLAIN (ANALYZE, BUFFERS)
WITH visible_projects AS NOT MATERIALIZED (
    SELECT
        id
    FROM
        "projects"
    WHERE
        "projects"."id" IN (
            -- 10000+ IDs removed
)),
visible_tasks AS MATERIALIZED (
    SELECT
        tasks.id
    FROM
        tasks
    WHERE
        tasks.project_id IN (
            SELECT
                id
            FROM
                visible_projects))
SELECT
    COUNT(1)
FROM
    visible_tasks;
Run Code Online (Sandbox Code Playgroud)

查询计划(depesz

Aggregate  (cost=1309912.31..1309912.32 rows=1 width=8) (actual time=148.661..153.739 rows=1 loops=1)
   Buffers: shared hit=73107 read=22301
   CTE visible_tasks
     ->  Gather  (cost=43024.54..1308639.80 rows=56556 width=4) (actual time=46.337..137.260 rows=48557 loops=1)
           Workers Planned: 2
           Workers Launched: 2
           Buffers: shared hit=73107 read=22301
           ->  Nested Loop  (cost=42024.54..1301984.20 rows=23565 width=4) (actual time=28.871..120.682 rows=16186 loops=3)
                 Buffers: shared hit=73107 read=22301
                 ->  Parallel Bitmap Heap Scan on projects  (cost=42023.97..138877.16 rows=4378 width=4) (actual time=28.621..52.627 rows=3502 loops=3)
                       Recheck Cond: (id = ANY ('{ REMOVED_IDS }'::integer[]))
                       Heap Blocks: exact=3536
                       Buffers: shared hit=30410 read=9833
                       ->  Bitmap Index Scan on projects_pkey  (cost=0.00..42021.35 rows=10507 width=0) (actual time=35.642..35.642 rows=10507 loops=1)
                             Index Cond: (id = ANY ('{ REMOVED_IDS }'::integer[]))
                             Buffers: shared hit=30410 read=1111
                 ->  Index Scan using test_tasks_on_project on tasks  (cost=0.57..263.85 rows=182 width=8) (actual time=0.012..0.018 rows=5 loops=10507)
                       Index Cond: (project_id = projects.id)
                       Buffers: shared hit=42697 read=12468
   ->  CTE Scan on visible_tasks  (cost=0.00..1131.12 rows=56556 width=0) (actual time=46.339..144.641 rows=48557 loops=1)
         Buffers: shared hit=73107 read=22301
 Planning:
   Buffers: shared hit=10 read=10
 Planning Time: 8.857 ms
 Execution Time: 156.102 ms
Run Code Online (Sandbox Code Playgroud)

建议查询

这是相同的查询结构,但我没有将 10000 多个 ID 直接插入visible_projectsCTE,而是嵌入了查找这些 ID 的 SQL。

EXPLAIN (ANALYZE, BUFFERS)
WITH visible_projects AS NOT MATERIALIZED (
    SELECT
        id
    FROM
        "projects"
    WHERE
        "projects"."company_id" = 11171
        AND "projects"."state" < 6
        AND "projects"."is_template" = FALSE),
visible_tasks AS MATERIALIZED (
    SELECT
        tasks.id
    FROM
        tasks
    WHERE
        tasks.project_id IN (
            SELECT
                id
            FROM
                visible_projects))
SELECT
    COUNT(1)
FROM
    visible_tasks;
Run Code Online (Sandbox Code Playgroud)

查询计划(depesz):

EXPLAIN (ANALYZE, BUFFERS)
WITH visible_projects AS NOT MATERIALIZED (
    SELECT
        id
    FROM
        "projects"
    WHERE
        "projects"."id" IN (
            -- 10000+ IDs removed
)),
visible_tasks AS MATERIALIZED (
    SELECT
        tasks.id
    FROM
        tasks
    WHERE
        tasks.project_id IN (
            SELECT
                id
            FROM
                visible_projects))
SELECT
    COUNT(1)
FROM
    visible_tasks;
Run Code Online (Sandbox Code Playgroud)

即使考虑到前两个查询的总和,这也是当前实现时间的 6 倍。

我看到这已经选择使用Parallel Seq Scan on tasks哪个是主要的贡献时间因素。我不明白的是为什么选择了这个,以及我应该怎么做才能恢复使用索引。

通过研究,我了解到 Postgres 不提供查询提示来强制使用索引,所以我认为一个好的解决方案将涉及向查询计划员证明使用索引是有益的。

我在这个问题中COUNT(1)结合使用AS MATERIALIZED/AS NOT MATERIALIZED控件来生成一个较小的示例。

应用程序中较大的查询不使用这些,但它也在tasks生成许多​​其他 CTE 和一些聚合指标作为最终结果之前对表执行一些过滤。

架构

                                                 Table "public.projects"
           Column           |             Type              | Collation | Nullable |               Default
----------------------------+-------------------------------+-----------+----------+--------------------------------------
 id                         | integer                       |           | not null | nextval('projects_id_seq'::regclass)
 name                       | character varying(255)        |           |          |
 description                | text                          |           |          |
 due                        | timestamp without time zone   |           |          |
 created_at                 | timestamp without time zone   |           | not null |
 updated_at                 | timestamp without time zone   |           | not null |
 client_id                  | integer                       |           |          |
 company_id                 | integer                       |           |          |
 repeat                     | boolean                       |           | not null | true
 end_date                   | timestamp without time zone   |           |          |
 prev_id                    | integer                       |           |          |
 next_id                    | integer                       |           |          |
 completed_tasks_count      | integer                       |           | not null | 0
 tasks_count                | integer                       |           | not null | 0
 done_at                    | timestamp without time zone   |           |          |
 state                      | integer                       |           |          |
 schedule                   | text                          |           |          |
 start_date                 | timestamp without time zone   |           |          |
 manager_id                 | integer                       |           |          |
 partner_id                 | integer                       |           |          |
 exschedule                 | text                          |           |          |
 extdue                     | timestamp without time zone   |           |          |
 is_template                | boolean                       |           | not null | false
 predicted_duration         | integer                       |           |          | 0
 budget                     | integer                       |           |          | 0
 cached_effective_due_date  | timestamp without time zone   |           |          |
 cached_manager_fullname    | character varying(255)        |           |          | ''::character varying
 cached_partner_fullname    | character varying(255)        |           |          | ''::character varying
 cached_staffs_fullnames    | text                          |           |          | ''::text
 cached_staffs_ids          | text                          |           |          | ''::text
 cached_label_ids           | character varying(255)        |           |          | ''::character varying
 date_in                    | timestamp without time zone   |           |          |
 cached_label_sum           | integer                       |           |          | 0
 date_out                   | timestamp without time zone   |           |          |
 turn_around_time           | integer                       |           |          | 0
 dues_calculated_at         | timestamp without time zone   |           |          |
 dues                       | timestamp without time zone[] |           |          |
 dues_rewind                | integer[]                     |           |          |
 quickbooks_item_id         | integer                       |           |          |
 perform_final_review       | boolean                       |           | not null | false
 quickbooks_desktop_item_id | integer                       |           |          |
 billing_model_type         | character varying             |           | not null | 'staff'::character varying
 series_id                  | integer                       |           |          |
 shared                     | boolean                       |           |          | false
Indexes:
    "projects_pkey" PRIMARY KEY, btree (id)
    "index_projects_on_cached_effective_due_date" btree (cached_effective_due_date)
    "index_projects_on_client_id" btree (client_id)
    "index_projects_on_company_id" btree (company_id)
    "index_projects_on_manager_id" btree (manager_id)
    "index_projects_on_next_id" btree (next_id)
    "index_projects_on_partner_id" btree (partner_id)
    "index_projects_on_series_id" btree (series_id)
    "index_projects_on_shared_and_is_template" btree (shared, is_template) WHERE shared = true AND is_template = true
Foreign-key constraints:
    "fk_rails_243d23cb48" FOREIGN KEY (quickbooks_desktop_item_id) REFERENCES quickbooks_desktop_items(id)
    "fk_rails_33ba8711de" FOREIGN KEY (quickbooks_item_id) REFERENCES quickbooks_items(id)
    "fk_rails_fcf0ca7614" FOREIGN KEY (series_id) REFERENCES series(id) NOT VALID
Referenced by:
    TABLE "tasks" CONSTRAINT "tasks_project_id_fkey" FOREIGN KEY (project_id) REFERENCES projects(id)
Run Code Online (Sandbox Code Playgroud)

projects表有 14,273,833 行。

  • 124,005 是 is_template = true
                                               Table "public.tasks"
         Column          |            Type             | Collation | Nullable |              Default
-------------------------+-----------------------------+-----------+----------+-----------------------------------
 id                      | integer                     |           | not null | nextval('tasks_id_seq'::regclass)
 name                    | character varying(255)      |           |          |
 description             | text                        |           |          |
 duedate                 | timestamp without time zone |           |          |
 created_at              | timestamp without time zone |           | not null |
 updated_at              | timestamp without time zone |           | not null |
 project_id              | integer                     |           | not null |
 done                    | boolean                     |           | not null | false
 position                | integer                     |           |          |
 done_at                 | timestamp without time zone |           |          |
 dueafter                | integer                     |           |          |
 done_by_user_id         | integer                     |           |          |
 predicted_duration      | integer                     |           |          |
 auto_predicted_duration | integer                     |           |          | 0
 assignable_id           | integer                     |           |          |
 assignable_type         | character varying           |           |          |
 will_assign_to_client   | boolean                     |           | not null | false
Indexes:
    "tasks_pkey" PRIMARY KEY, btree (id)
    "index_tasks_on_assignable_type_and_assignable_id" btree (assignable_type, assignable_id)
    "index_tasks_on_done_by_user_id" btree (done_by_user_id)
    "index_tasks_on_duedate" btree (duedate)
    "test_tasks_on_project" btree (project_id)
Foreign-key constraints:
    "tasks_project_id_fkey" FOREIGN KEY (project_id) REFERENCES projects(id)
Run Code Online (Sandbox Code Playgroud)

tasks表有 76,716,433 行。

系统规格

  • PostgreSQL 13.1
  • 2.9 GHz 6 核英特尔酷睿 i9
  • 32 GB 内存
  • macOS 10.15.7

Erw*_*ter 3

不同查询计划的主要原因可能是 Postgres估计要返回的行数增加projects

(cost=0.00..42021.35 rows=10507 width=0) (actual time=35.642..35.642 rows=10507 loops=1)

(cost=0.43..277961.56 rows=31322 width=4) (actual time=0.591..6970.696 rows=10507 loops=3)

高估了 3 倍,这并不引人注目,但显然足以支持不同的(较差的)查询计划。有关的:

假设projects.is_template主要是false,我建议使用这些多列索引:

CREATE INDEX ON projects(company_id, state);
Run Code Online (Sandbox Code Playgroud)

首先是平等,其次是范围。看:

您还可以尝试增加 、 和表的统计目标company_idstateANALYZE获得更好的估计。

和:

CREATE INDEX ON tasks (project_id, id);
Run Code Online (Sandbox Code Playgroud)

另外增加tasks.project_id和的统计目标ANALYZE

在这两种情况下,多列索引都可以替换project.company_id/上的索引task.project_id。由于所有列都是integer,索引的大小是相同的 - 除了索引重复数据删除的效果(使用 Postgres 13 添加),这在高度重复的测试tasks.project_id中表现得很强烈。看:

这个查询:

SELECT t.id
FROM   projects p
JOIN   tasks t ON t.project_id = p.id
WHERE  p.company_id = 11171
AND    p.state < 6
AND    p.is_template = FALSE;
Run Code Online (Sandbox Code Playgroud)

直接连接应该更快。