PostgreSQL not using index on a filtered multiple sort query

Mik*_*rst 6 sql sorting postgresql indexing postgresql-performance

I have a pretty simple table

CREATE TABLE approved_posts (
  project_id INTEGER,
  feed_id INTEGER,
  post_id INTEGER,
  approved_time TIMESTAMP NOT NULL,
  post_time TIMESTAMP NOT NULL,
  PRIMARY KEY (project_id, feed_id, post_id)
)
Run Code Online (Sandbox Code Playgroud)

And I'm trying to optimize this query:

SELECT *
FROM approved_posts
WHERE feed_id IN (?, ?, ?)
AND project_id = ?
ORDER BY approved_time DESC, post_time DESC
LIMIT 1;
Run Code Online (Sandbox Code Playgroud)

The query optimizer is fetching every single approved_post that matches the predicate, sorting all 100k results, and returning the top one it finds.

I do have an index on project_id, feed_id, approved_time, post_time, which it will use if I either:
A. remove the sort by post_time, or
B. replace the IN (?, ?, ?) with a single = ?.
Then it simply does a reverse index scan to get the first result and its blazingly fast.

Option A:

 Limit  (cost=0.43..6.57 rows=1 width=24) (actual time=0.101..0.101 rows=1 loops=1)
   ->  Index Scan Backward using approved_posts_approved_time_idx on approved_posts p  (cost=0.43..840483.02 rows=136940 width=24) (actual time=0.100..0.100 rows=1 loops=1)
     Filter: (feed_id = ANY ('{73321,73771,73772,73773,73774}'::integer[]))
     Rows Removed by Filter: 37
 Total runtime: 0.129 ms
Run Code Online (Sandbox Code Playgroud)

Option B:

Limit  (cost=0.43..3.31 rows=1 width=24) (actual time=0.065..0.065 rows=1 loops=1)
   ->  Index Scan Backward using approved_posts_full_pagination_index on approved_posts p  (cost=0.43..126884.70 rows=44049 width=24) (actual time=0.063..0.063 rows=1 loops=1)
     Index Cond: ((project_id = 148772) AND (feed_id = 73321))
 Total runtime: 0.092 ms
Run Code Online (Sandbox Code Playgroud)

But without these tweaks it is not so performant ...

Limit  (cost=169792.16..169792.17 rows=1 width=24) (actual time=510.225..510.225 rows=1 loops=1)
   ->  Sort  (cost=169792.16..170118.06 rows=130357 width=24) (actual time=510.224..510.224 rows=1 loops=1)
     Sort Key: approved_time, post_time
     Sort Method: top-N heapsort  Memory: 25kB
     ->  Bitmap Heap Scan on approved_posts p  (cost=12324.41..169140.38 rows=130357 width=24) (actual time=362.210..469.387 rows=126260 loops=1)
           Recheck Cond: (feed_id = ANY ('{73321,73771,73772,73773,73774}'::integer[]))
           ->  Bitmap Index Scan on approved_posts_feed_id_idx  (cost=0.00..12291.82 rows=130357 width=0) (actual time=354.496..354.496 rows=126260 loops=1)
                 Index Cond: (feed_id = ANY ('{73321,73771,73772,73773,73774}'::integer[]))
Total runtime: 510.265 ms
Run Code Online (Sandbox Code Playgroud)

I can even add a conditional index on these 5 feed ids and it will once again do the right thing.

My current best solution is to put every feed_id in its own query and do a massive UNION between them all. But this doesn't scale very well as I might want to select the top 500 from 30 feeds, pulling in 15k rows and sorting them for no good reason. Also managing offsets with this strategy is somewhat complex.

Does anybody know how I can do this IN clause with two sorts on my well-indexed data and get Postgres to do the right thing?

I'm using Postgres 9.3.3. Here are my indexes:

 "approved_posts_project_id_feed_id_post_id_key" UNIQUE CONSTRAINT, btree (project_id, feed_id, post_id)
 "approved_posts_approved_time_idx" btree (approved_time)
 "approved_posts_feed_id_idx" btree (feed_id)
 "approved_posts_full_pagination_index" btree (project_id, feed_id, approved_time, post_time)
 "approved_posts_post_id_idx" btree (post_id)
 "approved_posts_post_time_idx" btree (post_time)
 "approved_posts_project_id_idx" btree (project_id)
Run Code Online (Sandbox Code Playgroud)

None of the columns are nullable.

This table has 2m rows, split among 200 feed IDs and 19 project IDs.

These are the most common feed IDs:

 feed_id | count  
---------+--------
   73607 | 558860
   73837 | 354018
   73832 | 220285
   73836 | 172664
   73321 | 118695
   73819 |  95999
   73821 |  75871
   73056 |  65779
   73070 |  54655
   73827 |  43710
   73079 |  36700
   73574 |  36111
   73055 |  25682
   73072 |  22596
   73589 |  19856
   73953 |  15286
   73159 |  13059
   73839 |   8925
Run Code Online (Sandbox Code Playgroud)

In terms of min/max/avg cardinality per feedid/projectid pairing, we have:

 min |  max   |          avg          
-----+--------+-----------------------
   1 | 559021 | 9427.9140271493212670
Run Code Online (Sandbox Code Playgroud)

Erw*_*ter 6

有了 的可能值列表feed_id,Postgres 很难找到最佳查询计划。每个feed_id都可以与 1 - 559021 行(根据您的数字)相关联。Postgres 目前还不够聪明,无法单独看到特殊情况下的潜在优化LIMIT 1。一个UNION ALL(不仅仅是UNION)多个查询,一个feed_idLIMIT 1一个,加上另一个外部LIMIT 1(就像您似乎已经尝试过)展示了潜力,但需要复杂的查询串联以获取可变数量的输入值。

还有另一种方法可以说服查询规划器它可以使用索引扫描从索引中为每个选择第一行feed_id:用LATERAL连接重写您的查询:

SELECT a.*
FROM   (VALUES (?), (?), (?)) AS t(feed_id)
     , LATERAL (
   SELECT *
   FROM   approved_posts
   WHERE  project_id = ?
   AND    feed_id = t.feed_id
   ORDER  BY approved_time DESC, post_time DESC
   LIMIT  1
   ) a
ORDER  BY approved_time DESC, post_time DESC
LIMIT  1;
Run Code Online (Sandbox Code Playgroud)

或者,对于可变数量的值更方便feed_id

SELECT a.*
FROM   unnest(?) AS t(feed_id)  -- provide int[] var
     , LATERAL ( ...
Run Code Online (Sandbox Code Playgroud)

为变量传递一个整数数组,如'{123, 234, 345}'::int[]. 这也可以通过使用VARIADIC参数的函数优雅地实现。然后你可以传递一个integer值列表:

您的索引(project_id, feed_id, approved_time, post_time)适用于此,因为 Postgres 可以几乎与向前扫描索引一样快地向后扫描索引,但(project_id, feed_id, approved_time DESC, post_time DESC)会更好。看:

如果您不需要返回表的所有列,则甚至可以选择仅索引扫描。

你列approved_timepost_time被定义NOT NULL。否则,你必须做更多:

详细说明LATERAL连接技术的相关答案:

为什么你的选项 A 有效?

仔细观察会发现两件事

-> 使用approved_posts_approved_time_idx向后扫描索引
    在批准的帖子 p (成本=0.43..840483.02 行=136940 宽度=24)
                        (实际时间=0.100..0.100 行=1 循环=1)
     过滤器: (feed_id = ANY ('{73321,73771,73772,73773,73774}'::integer[]))

大胆强调我的。

  1. 使用了一个不同的、较小的 just 索引(approved_time)
  2. 没有指数条件feed_id(这将不会在这种情况下可能的),而是一个过滤器

Postgres 选择了一种完全不同的策略:它从该索引自下而上 ( Index Scan Backward)读取行,直到找到与 . 的给定值之一匹配的行feed_id。由于您只有很少的项目和提要 ( 200 feed IDs and 19 project IDs),因此很可能不必在第一次匹配之前丢弃太多行 - 这就是结果。对于 的值越多,这实际上会变得更快,因为更早找到“最新”行 - 与我的第一种方法不同,该方法对于较少的值更快。feed_id

一个有前途的替代策略!根据数据分布和查询中的提要,它可能比我的第一个解决方案更快 -使用以下索引启用它

"approved_posts_foo_idx" btree (project_id, approved_time DESC, post_time DESC)
Run Code Online (Sandbox Code Playgroud)

有选择地增加列的统计目标可能是值得的project_idfeed_id因此可以更准确地估计两种策略之间的临界点。


由于您的项目只有旧行(根据评论),您可以使用有关最大值的提示来改进此查询approved_time(并且post_time,但这可能不会增加太多) -如果知道approved_time每个项目的最大值(和/或 per feed_id) ,或者至少是一个上限。

SELECT ...
WHERE  ...
AND   approved_time <= $upper_bound
Run Code Online (Sandbox Code Playgroud)