加速数百万行的计数查询

Question

加速数百万行的计数查询

dvc*_*crn 4 postgresql performance index count postgresql-performance

假设一个充满产品的数据库。一个产品可以恰好属于 1 个集合并且由用户创建。数据库的粗略规模：

产品：52.000.000
收藏：9.000.000
用户：大约 9.000.000

我正在尝试检索用户拥有的产品+集合的数量，以及每个集合中的产品数量（该信息应该在所有 x 天生成并在 ElasticSearch 中编入索引）。

对于用户查询，我目前正在做这样的事情：

      SELECT
        users.*,
        (SELECT
          count(*)
        FROM
          products product
        WHERE
          product.user_id = user.id
        ) AS product_count,
        (SELECT
          count(*)
        FROM
          collections collection
        WHERE
          collection.user_id = user.id
        ) AS collection_count
      FROM
        users user

Run Code Online (Sandbox Code Playgroud)

所有 *_id 字段都已编入索引。使用解释（分析，详细）（删除敏感信息）：

 Limit  (cost=0.00..156500.97 rows=100 width=41) (actual time=0.064..28345.363 rows=100 loops=1)
   Output: (...), ((SubPlan 1)), ((SubPlan 2))
   ->  Seq Scan on public.users user  (cost=0.00..14549429167.11 rows=9296702 width=41) (actual time=0.064..28345.241 rows=100 loops=1)
         Output: (...), (SubPlan 1), (SubPlan 2)
         SubPlan 1
           ->  Aggregate  (cost=1415.84..1415.85 rows=1 width=0) (actual time=261.101..261.102 rows=1 loops=100)
                 Output: count(*)
                 ->  Bitmap Heap Scan on public.products product  (cost=7.32..1414.95 rows=355 width=0) (actual time=0.282..260.767 rows=382 loops=100)
                       Output: (...)
                       Recheck Cond: (product.user_id = user.id)
                       Heap Blocks: exact=32882
                       ->  Bitmap Index Scan on products_user_id_index  (cost=0.00..7.23 rows=355 width=0) (actual time=0.165..0.165 rows=382 loops=100)
                             Index Cond: (product.user_id = user.id)
         SubPlan 2
           ->  Aggregate  (cost=149.13..149.14 rows=1 width=0) (actual time=22.333..22.333 rows=1 loops=100)
                 Output: count(*)
                 ->  Index Only Scan using collections_user_id_index on public.collections collection  (cost=0.43..149.02 rows=44 width=0) (actual time=0.610..22.300 rows=28 loops=100)
                       Output: collection.user_id
                       Index Cond: (collection.user_id = user.id)
                       Heap Fetches: 2850
 Planning time: 0.214 ms
 Execution time: 28345.508 ms

Run Code Online (Sandbox Code Playgroud)

计时读取查询时：

限制 1：0.695 毫秒
限制 10：10434 毫秒
限制 100：150471 毫秒

由于检索多于几行时查询时间变得非常缓慢，我想知道是否可以加快速度。

如果我要加强 DB 机器，添加更多 CPU 会有帮助吗？AFAIK postgres 不会在多核上执行查询，所以我不确定这会有多大帮助。

（也有点相关，但是为什么count()for 集合使用仅索引扫描，而产品使用位图堆扫描？）

Answer 1

Erw*_*ter 5

虽然计算所有或大多数用户的数字，这是很多更有效的连接，而不是相关子查询之前，使用纯子查询到每个用户的总次数：

SELECT u.*, p.product_count, c.collection_count
FROM   users u
LEFT   JOIN (
   SELECT user_id AS id, count(*) AS product_count
   FROM   products
   GROUP  BY 1
   ) p USING (id)
LEFT   JOIN (
   SELECT user_id  As id, count(*) AS collection_count
   FROM   collections
   GROUP  BY 1
   ) c USING (id);

Run Code Online (Sandbox Code Playgroud)

我们在EXPLAIN输出中看到的仅索引扫描和位图索引仅有益于对一小部分行 ( LIMIT 100) 的查询。您的测试在这方面具有误导性。在为所有（或大多数）用户计算数字时，索引无济于事。顺序扫描会更快。

在Bitmap Heap Scan你看到的是只需要为位图索引扫描第二步-这是作为只是100行的惊喜。要么您的表统计信息已过时，要么那 100 个用户有很多相关产品，或者其中的行products高度聚集（物理上，这意味着一个用户的多行驻留在相同或少数数据页上）。Postgres 仅在希望在每个数据页中找到多行时才切换到位图索引扫描，这对于 100 个用户和 52.000.000 个产品（rows=355预期和rows=382发现）来说是一个惊喜。

有关的：

@dvcrn：一个大的“OFFSET”通常几乎和你将“LIMIT”增加相同的数量一样昂贵。这取决于您的查询、表和索引的详细信息。相关答案 [here](http://dba.stackexchange.com/a/112680/3684) 或 [here](http://dba.stackexchange.com/a/125959/3684)。但这与您最初的问题几乎不再相关。如果您仍然需要它，请提出一个***新问题***。 (2认同)

归档时间：	9 年前
查看次数：	3691 次
最近记录：	9 年前