如何正确实现复合最大 n 过滤

Fak*_*ame 5 postgresql performance greatest-n-per-group postgresql-performance

是的,每组最多的问题。

给定一个releases包含以下列的表:

 id         | primary key                 | 
 volume     | double precision            |
 chapter    | double precision            |
 series     | integer-foreign-key         |
 include    | boolean                     | not null
Run Code Online (Sandbox Code Playgroud)

我想选择音量的复合最大值,然后是一组系列的章节。

现在,如果我查询 per-distinct-series,我可以按如下方式轻松完成此操作:

SELECT 
       releases.chapter AS releases_chapter,
       releases.include AS releases_include,
       releases.series AS releases_series
FROM releases
WHERE releases.series = 741
  AND releases.include = TRUE
ORDER BY releases.volume DESC NULLS LAST, releases.chapter DESC NULLS LAST LIMIT 1;
Run Code Online (Sandbox Code Playgroud)

但是,如果我有大量series(我确实有),这很快就会遇到效率问题,我要发出 100 多个查询来生成单个页面。

喜欢滚整个事情到一个查询,在那里我可以简单地说WHERE releases.series IN (1,2,3....),但我还没有想出如何说服Postgres的,让我这样做。

天真的方法是:

SELECT releases.volume AS releases_volume,
       releases.chapter AS releases_chapter,
       releases.series AS releases_series
FROM 
    releases
WHERE 
    releases.series IN (12, 17, 44, 79, 88, 110, 129, 133, 142, 160, 193, 231, 235, 295, 340, 484, 499, 
                        556, 581, 664, 666, 701, 741, 780, 790, 796, 874, 930, 1066, 1091, 1135, 1137, 
                        1172, 1331, 1374, 1418, 1435, 1447, 1471, 1505, 1521, 1540, 1616, 1702, 1768, 
                        1825, 1828, 1847, 1881, 2007, 2020, 2051, 2085, 2158, 2183, 2190, 2235, 2255, 
                        2264, 2275, 2325, 2333, 2334, 2337, 2341, 2343, 2348, 2370, 2372, 2376, 2606, 
                        2634, 2636, 2695, 2696 )
  AND releases.include = TRUE
GROUP BY 
    releases_series
ORDER BY releases.volume DESC NULLS LAST, releases.chapter DESC NULLS LAST;
Run Code Online (Sandbox Code Playgroud)

这显然不起作用:

ERROR:  column "releases.volume" must appear in the 
        GROUP BY clause or be used in an aggregate function
Run Code Online (Sandbox Code Playgroud)

如果没有GROUP BY,它确实会获取所有内容,并且通过一些简单的过程过滤它甚至可以工作,但是在 SQL 中必须有一种“正确”的方法来做到这一点。

遵循错误,并添加聚合:

SELECT max(releases.volume) AS releases_volume,
       max(releases.chapter) AS releases_chapter,
       releases.series AS releases_series
FROM 
    releases
WHERE 
    releases.series IN (12, 17, 44, 79, 88, 110, 129, 133, 142, 160, 193, 231, 235, 295, 340, 484, 499, 
                        556, 581, 664, 666, 701, 741, 780, 790, 796, 874, 930, 1066, 1091, 1135, 1137, 
                        1172, 1331, 1374, 1418, 1435, 1447, 1471, 1505, 1521, 1540, 1616, 1702, 1768, 
                        1825, 1828, 1847, 1881, 2007, 2020, 2051, 2085, 2158, 2183, 2190, 2235, 2255, 
                        2264, 2275, 2325, 2333, 2334, 2337, 2341, 2343, 2348, 2370, 2372, 2376, 2606, 
                        2634, 2636, 2695, 2696 )
  AND releases.include = TRUE
GROUP BY 
    releases_series;
Run Code Online (Sandbox Code Playgroud)

大多数情况下有效,但问题是两个最大值不一致。如果我有两行,其中 volume:chapter 是 1:5 和 4:1,我需要返回 4:1,但独立最大值返回 4:5。

坦率地说,这在我的应用程序代码中实现起来非常简单,我必须在这里遗漏一些明显的东西。如何实现真正满足我的要求的查询?

Erw*_*ter 3

Postgres 中的简单解决方案是DISTINCT ON

SELECT DISTINCT ON (r.series)
       r.volume  AS releases_volume
     , r.chapter AS releases_chapter
     , r.series  AS releases_series
FROM   releases r
WHERE  r.series IN (
    12, 17, 44, 79, 88, 110, 129, 133, 142, 160, 193, 231, 235, 295, 340, 484, 499
  , 556, 581, 664, 666, 701, 741, 780, 790, 796, 874, 930, 1066, 1091, 1135, 1137
  , 1172, 1331, 1374, 1418, 1435, 1447, 1471, 1505, 1521, 1540, 1616, 1702, 1768
  , 1825, 1828, 1847, 1881, 2007, 2020, 2051, 2085, 2158, 2183, 2190, 2235, 2255
  , 2264, 2275, 2325, 2333, 2334, 2337, 2341, 2343, 2348, 2370, 2372, 2376, 2606
  , 2634, 2636, 2695, 2696)
AND    r.include
ORDER  BY r.series, r.volume DESC NULLS LAST, r.chapter DESC NULLS LAST;
Run Code Online (Sandbox Code Playgroud)

细节:

根据数据分布,可能有更快的技术:

此外,对于长列表,还有比IN ().

将非嵌套数组与LATERAL连接组合起来:

SELECT r.*
FROM   unnest('{12, 17, 44, 79, 88, 110, 129}'::int[]) t(i)  -- or many more items
     , LATERAL (
   SELECT volume  AS releases_volume
        , chapter AS releases_chapter
        , series  AS releases_series
   FROM   releases
   WHERE  series = t.i 
   AND    include
   ORDER  BY series, volume DESC NULLS LAST, chapter DESC NULLS LAST
   LIMIT  1
   ) r;
Run Code Online (Sandbox Code Playgroud)

往往更快。为了获得最佳性能,您需要一个匹配的多列索引,例如:

CREATE INDEX releases_series_volume_chapter_idx
ON releases(series, volume DESC NULLS LAST, chapter DESC NULLS LAST);
Run Code Online (Sandbox Code Playgroud)

有关的:

如果有不止几include不是true,而您只对带有 的行感兴趣include = true,那么请考虑部分多列索引

CREATE INDEX releases_series_volume_chapter_idx
ON releases(series, volume DESC NULLS LAST, chapter DESC NULLS LAST)
WHERE include;
Run Code Online (Sandbox Code Playgroud)