多对多关系中不同 ID 的最快查询

Question

多对多关系中不同 ID 的最快查询

St.*_*rio 6 postgresql performance count distinct postgresql-performance

我在 PostgreSQL 9.4 中有这个表：

CREATE TABLE user_operations( 
    id SERIAL PRIMARY KEY, 
    operation_id integer, 
    user_id integer )

Run Code Online (Sandbox Code Playgroud)

该表由~1000-2000不同的操作组成，每个操作对应于所有用户80000-120000集合S的某个子集（每个子集由大约元素组成）：

S = {1, 2, 3, ... , 122655}

Run Code Online (Sandbox Code Playgroud)

参数：

work_mem = 128MB
table_size = 880MB

Run Code Online (Sandbox Code Playgroud)

我也有一个关于operation_id.

问题：user_id对于operation_id集合的重要部分（20％-60％）查询所有不同的最佳计划是什么，例如：

SELECT DISTINCT user_id FROM user_operation WHERE operation_id < 500

Run Code Online (Sandbox Code Playgroud)

可以在表上创建更多索引。目前，查询的计划是：

HashAggregate  (cost=196173.56..196347.14 rows=17358 width=4) (actual time=1227.408..1359.947 rows=598336 loops=1)
  ->  Bitmap Heap Scan on user_operation  (cost=46392.24..189978.17 rows=2478155 width=4) (actual time=233.163..611.182 rows=2518122 loops=1)
        Recheck Cond: (operation_id < 500)
        ->  Bitmap Index Scan on idx  (cost=0.00..45772.70 rows=2478155 width=0) (actual time=230.432..230.432 rows=2518122 loops=1)
              Index Cond: (operation_id < 500)

Run Code Online (Sandbox Code Playgroud)

在这种情况下，这样的查询计划真的是最优的吗？我的意思是，我不确定使用Bitmap Heap Scan. 我将不胜感激任何对相关文章的引用。

Answer 1

Erw*_*ter 4

user_id 对于查询集合的重要部分operation_id(20%-60%)的所有不同项，最佳计划是什么？

使用递归查询：

WITH RECURSIVE cte AS (
   (  -- parentheses are required
   SELECT user_id
   FROM   user_operations
   WHERE  operation_id < 500
   ORDER  BY user_id
   LIMIT  1
   )
   UNION ALL
   SELECT u.user_id
   FROM   cte, LATERAL (
      SELECT user_id
      FROM   user_operations
      WHERE  operation_id < 500
      AND    user_id > cte.user_id  -- lateral reference
      ORDER  BY user_id
      LIMIT  1
      ) u
   )
TABLE cte;

Run Code Online (Sandbox Code Playgroud)

按该顺序与列(user_id, operation_id)上的索引结合使用。我期望索引扫描在第二列上进行过滤。相当准确的表统计信息非常重要，因此 Postgres 知道它只需跳过索引中的几行即可找到下一行。一般来说，人们可能希望增加以下方面的统计目标：user_idoperation_id

ALTER TABLE user_operations ALTER operation_id SET STATISTICS 1000;
Run Code Online (Sandbox Code Playgroud)
由于只有~1000-2000 different operations，这甚至可能没有必要，但这是一个很小的代价。

细节：

优化一系列时间戳（两列）的查询

如果谓词operation_id < 500是稳定的（始终相同），则将其设为部分索引(user_id)：

CREATE INDEX foo ON user_operations (user_id) WHERE operation_id < 500;
Run Code Online (Sandbox Code Playgroud)
那么的统计信息operation_id就不再与该查询相关了。

即使谓词不稳定，也可能有优化的方法 - 取决于所有可能的条件和值频率。

表演应该...美味。

我在SO的相关答案中优化了该技术（附有详细解释）：

优化 GROUP BY 查询以检索每个用户的最新记录

如果您有一个单独的users表，并且可以在示例中找到所有用户的很大一部分，则甚至可以使用更快的查询样式。链接答案中的详细信息。

归档时间：	10 年前
查看次数：	1293 次
最近记录：	10 年前