使用小 LIMIT 优化查询,对一列进行谓词并按另一列排序

god*_*yan 5 postgresql performance index optimization postgresql-9.3 postgresql-performance

我使用的是 Postgres 9.3.4,我有 4 个查询,它们的输入非常相似,但响应时间却大不相同:

查询#1

EXPLAIN ANALYZE SELECT posts.* FROM posts
WHERE posts.source_id IN (19082, 19075, 20705, 18328, 19110, 24965, 18329, 27600, 17804, 20717, 27598, 27599)
AND posts.deleted_at IS NULL
ORDER BY external_created_at desc
LIMIT 100 OFFSET 0;
                                                                                 QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.43..585.44 rows=100 width=1041) (actual time=326092.852..507360.199 rows=100 loops=1)
   ->  Index Scan using index_posts_on_external_created_at on posts  (cost=0.43..14871916.35 rows=2542166 width=1041) (actual time=326092.301..507359.524 rows=100 loops=1)
         Filter: (source_id = ANY ('{19082,19075,20705,18328,19110,24965,18329,27600,17804,20717,27598,27599}'::integer[]))
         Rows Removed by Filter: 6913925
 Total runtime: 507361.944 ms
Run Code Online (Sandbox Code Playgroud)

查询#2

EXPLAIN ANALYZE SELECT posts.* FROM posts
WHERE posts.source_id IN (5202, 5203, 661, 659, 662, 627)
AND posts.deleted_at IS NULL
ORDER BY external_created_at desc
LIMIT 100 OFFSET 0;                                            

    QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=31239.64..31239.89 rows=100 width=1041) (actual time=2.004..2.038 rows=100 loops=1)
   ->  Sort  (cost=31239.64..31261.26 rows=8648 width=1041) (actual time=2.003..2.017 rows=100 loops=1)
         Sort Key: external_created_at
         Sort Method: top-N heapsort  Memory: 80kB
         ->  Index Scan using index_posts_on_source_id on posts  (cost=0.44..30909.12 rows=8648 width=1041) (actual time=0.024..1.063 rows=944 loops=1)
               Index Cond: (source_id = ANY ('{5202,5203,661,659,662,627}'::integer[]))
               Filter: (deleted_at IS NULL)
               Rows Removed by Filter: 109
 Total runtime: 2.125 ms
Run Code Online (Sandbox Code Playgroud)

查询 #3

EXPLAIN ANALYZE SELECT posts.* FROM posts
WHERE posts.source_id IN (14790, 14787, 32928, 14796, 14791, 15503, 14789, 14772, 15506, 14794, 15543, 31615, 15507, 15508, 14800)
AND posts.deleted_at IS NULL
ORDER BY external_created_at desc
LIMIT 100 OFFSET 0;
                                                                             QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.43..821.25 rows=100 width=1041) (actual time=19.224..55.599 rows=100 loops=1)
   ->  Index Scan using index_posts_on_external_created_at on posts  (cost=0.43..14930351.58 rows=1818959 width=1041) (actual time=19.213..55.529 rows=100 loops=1)
         Filter: (source_id = ANY ('{14790,14787,32928,14796,14791,15503,14789,14772,15506,14794,15543,31615,15507,15508,14800}'::integer[]))
         Rows Removed by Filter: 414
 Total runtime: 55.683 ms
Run Code Online (Sandbox Code Playgroud)

查询 #4

EXPLAIN ANALYZE SELECT posts.* FROM posts
WHERE posts.source_id IN (18766, 18130, 18128, 18129, 19705, 28252, 18264, 18126, 18767, 27603, 28657, 28654, 28655, 19706, 18330)
AND posts.deleted_at IS NULL
ORDER BY external_created_at desc
LIMIT 100 OFFSET 0;
                                                                            QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.43..69055.29 rows=100 width=1041) (actual time=26.094..320.626 rows=100 loops=1)
   ->  Index Scan using index_posts_on_external_created_at on posts  (cost=0.43..14930351.58 rows=21621 width=1041) (actual time=26.093..320.538 rows=100 loops=1)
         Filter: (source_id = ANY ('{18766,18130,18128,18129,19705,28252,18264,18126,18767,27603,28657,28654,28655,19706,18330}'::integer[]))
         Rows Removed by Filter: 6156
 Total runtime: 320.778 ms
Run Code Online (Sandbox Code Playgroud)

除了查看具有不同source_ids 的帖子之外,所有 4 种都是相同的。

四个中的三个最终使用以下索引:

CREATE INDEX index_posts_on_external_created_at ON posts USING btree (external_created_at DESC)
WHERE (deleted_at IS NULL);
Run Code Online (Sandbox Code Playgroud)

#2 使用这个索引:

CREATE INDEX index_posts_on_source_id ON posts USING btree (source_id);
Run Code Online (Sandbox Code Playgroud)

对我来说有趣的是,在使用index_posts_on_external_created_at索引的 3 个中,两个非常快,而另一个(#1)非常慢。

查询 #2 的帖子比其他 3 个少得多,所以这可能解释了为什么它使用index_posts_on_source_id索引。但是,如果我去掉index_posts_on_external_created_at索引,其他 3 个查询在强制使用index_posts_on_source_id索引时会非常慢。

这是我对帖子表的定义:

CREATE TABLE posts (
    id integer NOT NULL,
    source_id integer,
    message text,
    image text,
    external_id text,
    created_at timestamp without time zone,
    updated_at timestamp without time zone,
    external text,
    like_count integer DEFAULT 0 NOT NULL,
    comment_count integer DEFAULT 0 NOT NULL,
    external_created_at timestamp without time zone,
    deleted_at timestamp without time zone,
    poster_name character varying(255),
    poster_image text,
    poster_url character varying(255),
    poster_id text,
    position integer,
    location character varying(255),
    description text,
    video text,
    rejected_at timestamp without time zone,
    deleted_by character varying(255),
    height integer,
    width integer
);
Run Code Online (Sandbox Code Playgroud)

我试过使用 CLUSTER posts USING index_posts_on_external_created_at

这本质上是一个按 external_created_at 排序的索引,这似乎是我发现的唯一有效方法。但是,我无法在生产中使用它,因为它在运行时会导致全局锁定数小时。我在 heroku 上,所以我无法安装pg_repack或类似的东西。

为什么#1 查询会这么慢,而其他查询真的很快?我能做些什么来缓解这种情况?

编辑:这是我的查询,没有LIMITORDER

查询#1

EXPLAIN ANALYZE SELECT posts.* FROM posts
WHERE posts.source_id IN (19082, 19075, 20705, 18328, 19110, 24965, 18329, 27600, 17804, 20717, 27598, 27599)
AND posts.deleted_at IS NULL
ORDER BY external_created_at desc;
                                                                        QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=7455044.81..7461163.56 rows=2447503 width=1089) (actual time=94903.143..95110.898 rows=238975 loops=1)
   Sort Key: external_created_at
   Sort Method: external merge  Disk: 81440kB
   ->  Bitmap Heap Scan on posts  (cost=60531.78..1339479.50 rows=2447503 width=1089) (actual time=880.150..90988.460 rows=238975 loops=1)
         Recheck Cond: (source_id = ANY ('{19082,19075,20705,18328,19110,24965,18329,27600,17804,20717,27598,27599}'::integer[]))
         Rows Removed by Index Recheck: 5484857
         Filter: (deleted_at IS NULL)
         Rows Removed by Filter: 3108465
         ->  Bitmap Index Scan on index_posts_on_source_id  (cost=0.00..59919.90 rows=3267549 width=0) (actual time=877.904..877.904 rows=3347440 loops=1)
               Index Cond: (source_id = ANY ('{19082,19075,20705,18328,19110,24965,18329,27600,17804,20717,27598,27599}'::integer[]))
 Total runtime: 95534.724 ms
Run Code Online (Sandbox Code Playgroud)

查询#2

EXPLAIN ANALYZE SELECT posts.* FROM posts
WHERE posts.source_id IN (5202, 5203, 661, 659, 662, 627)
AND posts.deleted_at IS NULL
ORDER BY external_created_at desc;
                                                                     QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=36913.72..36935.85 rows=8852 width=1089) (actual time=212.450..212.549 rows=944 loops=1)
   Sort Key: external_created_at
   Sort Method: quicksort  Memory: 557kB
   ->  Index Scan using index_posts_on_source_id on posts  (cost=0.44..32094.90 rows=8852 width=1089) (actual time=1.732..209.590 rows=944 loops=1)
         Index Cond: (source_id = ANY ('{5202,5203,661,659,662,627}'::integer[]))
         Filter: (deleted_at IS NULL)
         Rows Removed by Filter: 109
 Total runtime: 214.507 ms
Run Code Online (Sandbox Code Playgroud)

查询 #3

EXPLAIN ANALYZE SELECT posts.* FROM posts
WHERE posts.source_id IN (14790, 14787, 32928, 14796, 14791, 15503, 14789, 14772, 15506, 14794, 15543, 31615, 15507, 15508, 14800)
AND posts.deleted_at IS NULL
ORDER BY external_created_at desc;
                                                                        QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=5245032.87..5249894.14 rows=1944508 width=1089) (actual time=131032.952..134342.372 rows=1674072 loops=1)
   Sort Key: external_created_at
   Sort Method: external merge  Disk: 854864kB
   ->  Bitmap Heap Scan on posts  (cost=48110.86..1320005.55 rows=1944508 width=1089) (actual time=605.648..91351.334 rows=1674072 loops=1)
         Recheck Cond: (source_id = ANY ('{14790,14787,32928,14796,14791,15503,14789,14772,15506,14794,15543,31615,15507,15508,14800}'::integer[]))
         Rows Removed by Index Recheck: 5304550
         Filter: (deleted_at IS NULL)
         Rows Removed by Filter: 879414
         ->  Bitmap Index Scan on index_posts_on_source_id  (cost=0.00..47624.73 rows=2596024 width=0) (actual time=602.744..602.744 rows=2553486 loops=1)
               Index Cond: (source_id = ANY ('{14790,14787,32928,14796,14791,15503,14789,14772,15506,14794,15543,31615,15507,15508,14800}'::integer[]))
 Total runtime: 136176.868 ms
Run Code Online (Sandbox Code Playgroud)

查询 #4

EXPLAIN ANALYZE SELECT posts.* FROM posts
WHERE posts.source_id IN (18766, 18130, 18128, 18129, 19705, 28252, 18264, 18126, 18767, 27603, 28657, 28654, 28655, 19706, 18330)
AND posts.deleted_at IS NULL
ORDER BY external_created_at desc;
                                                                       QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=102648.92..102704.24 rows=22129 width=1089) (actual time=15225.250..15256.931 rows=51408 loops=1)
   Sort Key: external_created_at
   Sort Method: external merge  Disk: 35456kB
   ->  Index Scan using index_posts_on_source_id on posts  (cost=0.45..79869.91 rows=22129 width=1089) (actual time=3.975..14803.320 rows=51408 loops=1)
         Index Cond: (source_id = ANY ('{18766,18130,18128,18129,19705,28252,18264,18126,18767,27603,28657,28654,28655,19706,18330}'::integer[]))
         Filter: (deleted_at IS NULL)
         Rows Removed by Filter: 54
 Total runtime: 15397.453 ms
Run Code Online (Sandbox Code Playgroud)

Postgres 内存设置:

name, setting, unit
'default_statistics_target','100',''
'effective_cache_size','16384','8kB'
'maintenance_work_mem','16384','kB'
'max_connections','100',''
'random_page_cost','4',NULL
'seq_page_cost','1',NULL
'shared_buffers','16384','8kB'
'work_mem','1024','kB'
Run Code Online (Sandbox Code Playgroud)

数据库统计:

Total Posts: 20,997,027
Posts where deleted_at is null: 15,665,487
Distinct source_id's: 22,245
Max number of rows per single source_id: 1,543,950
Min number of rows per single source_id: 1
Most source_ids in a single query: 21
Distinct external_created_at: 11,146,151
Run Code Online (Sandbox Code Playgroud)

Erw*_*ter 5

一般建议

所有关于性能优化的一般建议都适用。默认设置非常保守,其中一些资源设置对于具有数百万行(特别是)的表来说太低work_mem。您需要配置 RDBMS 以明智地使用可用 RAM。Postgres Wiki 是一个很好的起点。这超出了此处单个问题的范围。

但是,我在下面建议的查询只需要非常适度的资源设置。

还要增加统计目标,source_id以便对关键列进行更详细的统计:

ALTER TABLE posts ALTER COLUMN source_id SET STATISTICS 2000;  -- or similar
Run Code Online (Sandbox Code Playgroud)

然后: ANALYZE posts;

更多的:

您可以进一步优化存储(以获得较小的收益):

询问

查询本身很难优化。高级性能优化参考@ypercube的相关问题:

有一个简单的方法,如果...

  • source_id每个查询的不同数量相当少
  • 而且LIMIT也相当小。

...根据您添加的详细信息,这对您的情况是正确的。

以下查询所需的唯一索引

CREATE INDEX posts_special_idx ON posts (source_id, external_created_at DESC)
WHERE deleted_at IS NULL;
Run Code Online (Sandbox Code Playgroud)

基于您的查询 #1 的示例:

SELECT p.*
FROM   unnest('{19082, 19075, 20705, 18328, 19110, 24965, 18329, 27600
              , 17804, 20717, 27598, 27599}'::int[]) s(source_id)
     , LATERAL (
   SELECT *
   FROM   posts
   WHERE  source_id = s.source_id
   AND    deleted_at IS NULL
   ORDER  BY external_created_at DESC
   LIMIT  100
   ) p
ORDER  BY p.external_created_at DESC
LIMIT  100;
Run Code Online (Sandbox Code Playgroud)

这是模拟松散索引扫描,类似于此处详细讨论的内容:

如果n是 source_id 的数量(幸运的是永远不会 > 21),我们让 Postgres从索引中获取前 100 行(根据external_created_at DESCsource_id,这本身非常快,但最大。(n-1) * 100行是多余的。鉴于您的价值频率:

22,245source_id行,从 1 到 1,543,950 行 - 总共 20,997,027 行

(您没有说明所有这些数字是否都包含“已删除”行,但只有约 25% 是“已删除”行。)

...我希望某些source_id's 开始时少于 100 行。所以我们只需要在最坏的情况下(通常更少)对 2100 行进行排序以保持前 100 行。这不应该表现得那么糟糕 - 一旦您使用适当的资源设置配置了 Postgres。

如果您有一个包含所有 distinct 的源表source_id,那么使用它并source_id尽早消除不存在的可能是有意义的:

SELECT p.*
FROM   source s, LATERAL ( ... ) p
WHERE  s.source_id IN (19082, 19075, 20705, ...)
ORDER  BY ...
Run Code Online (Sandbox Code Playgroud)

IN此表单最多可以有 21 个值,但请考虑以下相关问题:

如果您知道结果中external_created_at单个行的最小或最大行数,则可以进一步优化source_id...