如何使用反连接加速查询

Question

如何使用反连接加速查询

Gar*_*ett 9 postgresql performance index join query-performance

我有一个包含 2 个反连接（UserEmails = 1M+ 行和Subscriptions = <100k 行）、2 个条件和一个排序的查询。我为2个条件+排序创建了索引，这将查询速度提高了50%。两个反连接都有索引。但是，查询太慢（生产时 4 秒）。

这是查询：

SELECT
    "Users"."firstName",
    "Users"."lastName",
    "Users"."email",
    "Users"."id"
FROM
    "Users"
WHERE
    NOT EXISTS (
        SELECT
            1
        FROM
            "UserEmails"
        WHERE
            "UserEmails"."userId" = "Users". ID
    )
AND NOT EXISTS (
    SELECT
        1
    FROM
        "Subscriptions"
    WHERE
        "Subscriptions"."userId" = "Users". ID
)
AND "isEmailVerified" = TRUE
AND "emailUnsubscribeDate" IS NULL
ORDER BY
    "Users"."createdAt" DESC
LIMIT 100

Run Code Online (Sandbox Code Playgroud)

这是解释：

Limit  (cost=1.28..177.77 rows=100 width=49) (actual time=6171.121..6171.850 rows=100 loops=1)
  ->  Nested Loop Anti Join  (cost=1.28..4665810.76 rows=2643614 width=49) (actual time=6171.119..6171.807 rows=100 loops=1)
        ->  Nested Loop Anti Join  (cost=0.86..3470216.17 rows=2707688 width=49) (actual time=0.809..6062.152 rows=28607 loops=1)
              ->  Index Scan using users_email_subscribers_idx on "Users"  (cost=0.43..1844686.50 rows=3312999 width=49) (actual time=0.055..2342.793 rows=1186607 loops=1)
              ->  Index Only Scan using "UserEmails_userId_emailId_key" on "UserEmails"  (cost=0.43..0.49 rows=1 width=4) (actual time=0.002..0.002 rows=1 loops=1186607)
                    Index Cond: ("userId" = "Users".id)
                    Heap Fetches: 1153034
        ->  Index Only Scan using "Subscriptions_userId_type_key" on "Subscriptions"  (cost=0.42..0.44 rows=1 width=4) (actual time=0.003..0.003 rows=1 loops=28607)
              Index Cond: ("userId" = "Users".id)
              Heap Fetches: 28507
Planning time: 2.346 ms
Execution time: 6171.963 ms

Run Code Online (Sandbox Code Playgroud)

这是速度提高了 50% 的指标：

Limit  (cost=1.28..177.77 rows=100 width=49) (actual time=6171.121..6171.850 rows=100 loops=1)
  ->  Nested Loop Anti Join  (cost=1.28..4665810.76 rows=2643614 width=49) (actual time=6171.119..6171.807 rows=100 loops=1)
        ->  Nested Loop Anti Join  (cost=0.86..3470216.17 rows=2707688 width=49) (actual time=0.809..6062.152 rows=28607 loops=1)
              ->  Index Scan using users_email_subscribers_idx on "Users"  (cost=0.43..1844686.50 rows=3312999 width=49) (actual time=0.055..2342.793 rows=1186607 loops=1)
              ->  Index Only Scan using "UserEmails_userId_emailId_key" on "UserEmails"  (cost=0.43..0.49 rows=1 width=4) (actual time=0.002..0.002 rows=1 loops=1186607)
                    Index Cond: ("userId" = "Users".id)
                    Heap Fetches: 1153034
        ->  Index Only Scan using "Subscriptions_userId_type_key" on "Subscriptions"  (cost=0.42..0.44 rows=1 width=4) (actual time=0.003..0.003 rows=1 loops=28607)
              Index Cond: ("userId" = "Users".id)
              Heap Fetches: 28507
Planning time: 2.346 ms
Execution time: 6171.963 ms

Run Code Online (Sandbox Code Playgroud)

编辑：我还应该提到，users_email_subscribers_idx显示的是索引扫描，而不是仅索引扫描，因为索引正在定期更新。

Answer 1

jja*_*nes 5

您最好的选择可能是在应用程序级别解决这个问题。这看起来像是作为数据清理练习的一部分运行的查询。如果是这样，为什么你关心它是否需要 6 秒才能运行，为什么你将其限制为 100 行而不是一次性读取所有行？也许您可以使用物化视图或其他一些缓存机制。如果您拒绝该选项，请继续阅读一些“次佳”选项。

我还应该提到，users_email_subscribers_idx 显示的是索引扫描，而不是仅索引扫描，因为索引正在定期更新。

这不是原因。您需要 Users 表中未包含在索引中的列，例如 firstName 和 id。如果您创建的索引的所有这些列都位于列列表的末尾，那么您将获得仅索引扫描。这可能会使查询速度加快 20%，但不会使其速度加快 99%。

               Heap Fetches: 1153034
Run Code Online (Sandbox Code Playgroud)

您需要更积极地清理用户电子邮件。再次强调，这不会有 99% 的改进，但应该会有所帮助。Autovacuum 在保持表充分清理以优化仅索引扫描方面做得不好。您可以进行手动吸尘。或者，您可以尝试通过将每个表的“autovacuum_vacuum_scale_factor”设置降低为零，然后将每个表的“autovacuum_vacuum_threshold”设置为控制清理来强制 autovacuum 做得更好。如果表在整个表中随机更新，我会将“autovacuum_vacuum_threshold”设置为表中块数的大约 1/20。

如果您进行实验，查询的执行情况如何set enable_nestedloop to off？这可能会给你哈希反连接，如果你的版本足够新，你可能会得到它们的并行版本。

Answer 2

Ark*_*ena 3

规划器估计的行数与实际行数之间存在巨大差异。这意味着规划者根据虚假信息选择了计划。

例如，Nested Loop Anti Join (cost=0.86..3470216.17 rows=2707688 width=49) (actual time=0.809..6062.152 rows=28607 loops=1)意味着他估计会得到 2 707 688，而实际得到的是 28 607。

要么你的统计数据不准确（如果你从未调整过autovacuum那些巨大表的设置，我敢打赌），要么你有一列依赖于另一列，而另一列不是键的一部分（第三范式违规）。

要更频繁地刷新静态数据，您可以调整autovacuum这些大表的设置。我强烈建议您阅读该博客文章以了解 autovacuum 调优。

如果您的模型违反了第三范式，您可以纠正您的模型（成本较高，但从长远来看更好），也可以让刨床收集相关列的统计信息（请参阅此处的create statistics文档）。

这不一定是统计或估计问题。估计的行计数假设该节点运行完成，而实际行计数反映了由于达到 LIMIT 100 而提前终止。 (4认同)

归档时间：	6 年，7 月前
查看次数：	8658 次
最近记录：	6 年，7 月前