为什么 postgres 进行表扫描而不是使用我的索引？

Question

为什么 postgres 进行表扫描而不是使用我的索引？

Yeh*_*sef 5 postgresql indexing sql-execution-plan postgresql-performance

我正在 Postgres 中使用 HackerNews 数据集。大约有 1700 万行，其中大约 1450 万行是评论，大约 250 万行是故事。有一个名为“rbanffy”的非常活跃的用户，他提交了 25,000 条文章，大约有相等的分裂故事/评论。“by”和“type”都有单独的索引。

我有一个疑问：

SELECT *
FROM "hn_items"
WHERE by = 'rbanffy'
and type = 'story'
ORDER BY id DESC
LIMIT 20 OFFSET 0

Run Code Online (Sandbox Code Playgroud)

运行速度很快（它使用“by”索引）。如果我将类型更改为“评论”，那么速度会非常慢。从解释来看，它不使用任何索引并进行扫描。

Limit  (cost=0.56..56948.32 rows=20 width=1937)
  ->  Index Scan using hn_items_pkey on hn_items  (cost=0.56..45823012.32 rows=16093 width=1937)
        Filter: (((by)::text = 'rbanffy'::text) AND ((type)::text = 'comment'::text))

Run Code Online (Sandbox Code Playgroud)

如果我将查询更改为 has type||''='comment'，那么它将使用“by”索引并快速执行。

为什么会发生这种情况？我从/sf/answers/21687011/了解到，必须进行这样的黑客攻击意味着出现了问题。但我不知道是什么。

编辑：
这是 type='story' 的解释

Limit  (cost=72553.07..72553.12 rows=20 width=1255)
  ->  Sort  (cost=72553.07..72561.25 rows=3271 width=1255)
        Sort Key: id DESC
        ->  Bitmap Heap Scan on hn_items  (cost=814.59..72466.03 rows=3271 width=1255)
              Recheck Cond: ((by)::text = 'rbanffy'::text)
              Filter: ((type)::text = 'story'::text)
              ->  Bitmap Index Scan on hn_items_by_index  (cost=0.00..813.77 rows=19361 width=0)
                    Index Cond: ((by)::text = 'rbanffy'::text)

Run Code Online (Sandbox Code Playgroud)

编辑：解释（分析，缓冲区）

Limit  (cost=0.56..59510.10 rows=20 width=1255) (actual time=20.856..545.282 rows=20 loops=1)
  Buffers: shared hit=21597 read=2658 dirtied=32
  ->  Index Scan using hn_items_pkey on hn_items  (cost=0.56..47780210.70 rows=16058 width=1255) (actual time=20.855..545.271 rows=20 loops=1)
        Filter: (((by)::text = 'rbanffy'::text) AND ((type)::text = 'comment'::text))
        Rows Removed by Filter: 46798
        Buffers: shared hit=21597 read=2658 dirtied=32
Planning time: 0.173 ms
Execution time: 545.318 ms

Run Code Online (Sandbox Code Playgroud)

编辑：解释（分析，缓冲区）type='story'

Limit  (cost=72553.07..72553.12 rows=20 width=1255) (actual time=44.121..44.127 rows=20 loops=1)
  Buffers: shared hit=20137
  ->  Sort  (cost=72553.07..72561.25 rows=3271 width=1255) (actual time=44.120..44.123 rows=20 loops=1)
        Sort Key: id DESC
        Sort Method: top-N heapsort  Memory: 42kB
        Buffers: shared hit=20137
        ->  Bitmap Heap Scan on hn_items  (cost=814.59..72466.03 rows=3271 width=1255) (actual time=6.778..37.774 rows=11630 loops=1)
              Recheck Cond: ((by)::text = 'rbanffy'::text)
              Filter: ((type)::text = 'story'::text)
              Rows Removed by Filter: 12587
              Heap Blocks: exact=19985
              Buffers: shared hit=20137
              ->  Bitmap Index Scan on hn_items_by_index  (cost=0.00..813.77 rows=19361 width=0) (actual time=3.812..3.812 rows=24387 loops=1)
                    Index Cond: ((by)::text = 'rbanffy'::text)
                    Buffers: shared hit=152
Planning time: 0.156 ms
Execution time: 44.422 ms

Run Code Online (Sandbox Code Playgroud)

编辑：最新的测试结果我正在处理查询type='comment'，并注意到如果将限制更改为更高的数字（例如 100），它会使用索引by。我反复研究这些值，直到发现关键数字是“47”。如果我的限制为 47，则by使用索引，如果我的限制为 46，则使用完整扫描。我认为这个数字并不神奇，只是恰好是我的数据集或我不知道的其他一些变量的阈值。我不知道这是否有帮助。

Answer 1

Lau*_*lbe 3

comment由于by 的ate 数量很多，PostgreSQL 认为如果按照子句（可以使用主键索引）隐含的顺序搜索表，直到找到 20 条符合搜索条件的行，那么rbanffy它就足够快了。ORDER BY

不幸的是，这家伙最近变得懒惰了——无论如何，PostgreSQL 必须扫描 46798 个最高的ids，直到找到 20 个命中。（你真的不应该删除那个Backwards让我困惑的。）

解决这个问题的最佳方法是混淆 PostgreSQL，使其不选择主键索引，也许像这样：

SELECT *
FROM (SELECT * FROM hn_items
      WHERE by = 'rbanffy'
        AND type = 'comment'
      OFFSET 0) q
ORDER BY id DESC
LIMIT 20;

Run Code Online (Sandbox Code Playgroud)

归档时间：	8 年，1 月前
查看次数：	1167 次
最近记录：	8 年，1 月前