优化对两个大表的查询

Question

优化对两个大表的查询

Iva*_*Paz 4 postgresql optimization index-tuning query-performance

我的系统中有一个非常重要的查询，由于表上的数据量很大，执行时间太长。我是一名初级 DBA，我需要为此进行最佳优化。每个表大约有 8000 万行。

表是：

tb_pd：

   Column            |  Type   | Modifiers | Storage | Stats target | Description 
---------------------+---------+-----------+---------+--------------+-------------
 pd_id               | integer | not null  | plain   |              | 
 st_id               | integer |           | plain   |              | 
 status_id           | integer |           | plain   |              | 
 next_execution_date | bigint  |           | plain   |              | 
 priority            | integer |           | plain   |              | 
 is_active           | integer |           | plain   |              | 
Indexes:
    "pk_pd" PRIMARY KEY, btree (pd_id)
    "idx_pd_order" btree (priority, next_execution_date)
    "idx_pd_where" btree (status_id, next_execution_date, is_active)
Foreign-key constraints:
    "fk_st" FOREIGN KEY (st_id) REFERENCES tb_st(st_id)

Run Code Online (Sandbox Code Playgroud)

tb_st：

 Column |          Type          | Modifiers | Storage  | Stats target | Description 
--------+------------------------+-----------+----------+--------------+-------------
 st_id  | integer                | not null  | plain    |              | 
 st     | character varying(500) |           | extended |              | 
Indexes:
    "pk_st" PRIMARY KEY, btree (st_id)
Referenced by:
    TABLE "tb_pd" CONSTRAINT "fk_st" FOREIGN KEY (st_id) REFERENCES tb_st(st_id)

Run Code Online (Sandbox Code Playgroud)

我的查询是：

select s.st                                               
from tb_pd p inner join
     tb_st s on p.st_id = s.st_id
where p.status_id = 1 and
      p.next_execution_date < 1401402110830 and
      p.is_active = 1
order by priority, next_execution_date
limit 20000;

Run Code Online (Sandbox Code Playgroud)

使用我拥有的索引，我得到的最好的是：

Limit  (cost=1.14..263388.65 rows=20000 width=45)
   ->  Nested Loop  (cost=1.14..456016201.43 rows=34627017 width=45)
         ->  Index Scan using idx_pd_order on tb_pd p  (cost=0.57..161388942.77 rows=34627017 width=16)
               Index Cond: (next_execution_date < 1401402110830::bigint)
               Filter: ((status_id = 1) AND (is_active = 1))
         ->  Index Scan using pk_st on tb_st s  (cost=0.57..8.50 rows=1 width=37)
               Index Cond: (st_id = p.st_id)

Run Code Online (Sandbox Code Playgroud)

我不能很好地理解解释，但它没有使用idx_pd_where过滤 where 子句。在idx_pd_where有where子句中使用的所有列。

有关数据的更多信息：
status_id是 95% = 1
is_active是 90% = 1
next_execution_date以毫秒为单位并且变化很大。比较的值是执行的时刻（当前时间以毫秒为单位）

我应该为每个过滤列创建单独的索引还是使用任何不同类型的索引？也许在 DBMS 上进行一些配置？

Answer 1

Erw*_*ter 9

这是一个棘手的问题。您的主要条件是 on next_execution_date，但输出按priority第一个排序。条件上status_id且is_active仅起次要作用。

更好的指数

您的索引idx_pd_order没有太大帮助，因为过滤多列索引的非前导列效率不高。Postgres 正在使用它 - 仍然比顺序扫描好得多。此处的详细信息：
复合索引是否也适用于第一个字段的查询？

idx_pd_where可能是更好的选择，但也不是一个好选择。前导列status_id根本没有选择性，只会使索引膨胀。尾随列也是如此is_active。并且priority不在索引中，必须从表中获取，因此无法进行仅索引扫描。

我建议从这个部分的多列索引开始。（但请继续阅读！）

CREATE INDEX idx_pd_covering ON tb_pd (next_execution_date, priority, st_id)
WHERE  status_id = 1 AND is_active = 1

Run Code Online (Sandbox Code Playgroud)

因为我们只对包含索引的行感兴趣，status_id = 1并立即is_active = 1从索引中排除其他行。大小的确事。
剩下的（关键）条件是 on next_execution_date，它必须在索引中排在第一位。
priority并且st_id仅附加到可能的仅索引扫描（Postgres 9.2+）。如果这不可行，请从索引中删除列以使其更小。

特殊难度

我们现在可以使用idx_pd_covering快速查找符合条件的行，不幸的是，我们必须查看所有符合条件的行以收集具有最高priority. 正如查询计划显示的那样，Postgres 估计处理34627017 行。对 35M 行进行排序将花费很大。这就是我一开始提到的棘手部分。为了演示我在说什么，请EXPLAIN在有和没有 priorityin 的情况下运行您的查询ORDER BY：

SELECT s.st                                               
FROM   tb_pd p
JOIN   tb_st s USING (st_id)
WHERE  p.status_id = 1
AND    p.is_active = 1
AND    p.next_execution_date < 1401402110830
ORDER  BY priority, next_execution_date
LIMIT  20000;

Run Code Online (Sandbox Code Playgroud)

那是您的查询，格式仅略有简化。你应该看到一个巨大的差异。

解决方案

解决方案取决于的不同值的数量priority。由于缺乏信息和出于演示目的，我将只假设三个。优先级1，2和3。

使用少量不同的优先级值，有一个简单的解决方案。创建三个部分索引。所有这些加起来仍然小于您当前的索引idx_pd_order或idx_pd_where（您可能不再需要）。

CREATE INDEX idx_pd_covering_p1 ON tb_pd (next_execution_date, st_id)
WHERE  priority = 1 AND status_id = 1 AND is_active = 1;

CREATE INDEX idx_pd_covering_p2 ON tb_pd (next_execution_date, st_id)
WHERE  priority = 2 AND status_id = 1 AND is_active = 1;

CREATE INDEX idx_pd_covering_p3 ON tb_pd (next_execution_date, st_id)
WHERE  priority = 3 AND status_id = 1 AND is_active = 1;

Run Code Online (Sandbox Code Playgroud)

使用此查询：

SELECT s.st
FROM  (
   (
   SELECT st_id
   FROM   tb_pd
   WHERE  status_id = 1
   AND    is_active = 1
   AND    priority  = 1
   AND    next_execution_date < 1401402110830
   ORDER  BY next_execution_date
   )
   UNION ALL
   (
   SELECT st_id
   FROM   tb_pd
   WHERE  status_id = 1
   AND    is_active = 1
   AND    priority  = 2
   AND    next_execution_date < 1401402110830
   ORDER  BY next_execution_date
   )
   UNION ALL
   (
   ...
   AND    priority  = 3
   ...
   )
   LIMIT  20000
   ) p
JOIN   tb_st s USING (st_id);

Run Code Online (Sandbox Code Playgroud)

这应该是炸药。

严格来说，ORDER BY如果外部查询中没有附加子句，则无法保证最终顺序。在当前的实现中，只要外部查询就这么简单，内部查询的顺序就会保留下来。可以肯定的是，您可以立即加入（这可能会慢一点）：

)
SELECT s.st
FROM   tb_pd p
JOIN   tb_st s USING (st_id)
WHERE  p.status_id = 1
AND    p.is_active = 1
AND    p.priority  = 1
AND    p.next_execution_date < 1401402110830
ORDER  BY p.next_execution_date
)
UNION ALL
(
...
)
LIMIT  20000;

Run Code Online (Sandbox Code Playgroud)

.. 或者继续priority并next_execution_date在外部查询中再次订购（绝对肯定），这可能会更慢，但是。

所有括号都需要！相关回答。
这个查询只是从上面部分索引的顶部读取元组，根本不需要排序步骤。所有行都预先排序，启动效率高。
UNION ALLORDER BY一旦LIMIT获取了顶级中请求的行数，没有final的查询就可以停止。因此，如果最高优先级中有足够多的行，UNION ALL则永远不会执行查询的后续部分。这样，只需要接触较小的部分索引。
JOIN到tb_st后来，应该效率更高。
同样，该列st_id仅附加到索引以希望进行仅索引扫描。如果这对您有用，则整个查询甚至根本不会触及表格tb_pd。

任意数量不同`priority`值的通用解决方案

我们之前已经解决了这个问题。有一个完整的配方可以自动创建部分索引和一个函数..作品：
空间索引可以帮助“范围 - 按 - 限制”查询

优化表

由于您正在尝试优化性能并且您的表格很大，我建议您的表格布局略有改变tb_pd：

   Column            |  Type
---------------------+--------
 pd_id               | integer
 st_id               | integer
 next_execution_date | bigint
 priority            | integer  -- or smallint? -- or "char"?
 status_id           | smallint -- or "char"
 is_active           | boolean

Run Code Online (Sandbox Code Playgroud)

这在磁盘上每行占用 52 个字节，而您当前的设计需要 60 个字节。指数也获利。详细信息：
配置 PostgreSQL 以提高读取性能

当然，性能优化的所有基本建议也适用。

关于"char"：

类型"char"（注意引号）的不同之处char(1)在于它只使用一个字节的存储空间。它在系统目录中作为一种简单的枚举类型在内部使用。

归档时间：	12 年前
查看次数：	5466 次
最近记录：	7 年，7 月前

优化对两个大表的查询

更好的指数

特殊难度

解决方案

任意数量不同priority值的通用解决方案

优化表

任意数量不同`priority`值的通用解决方案