加速 GROUP BY, HAVING COUNT 查询

SGr*_*SGr 3 postgresql performance index count postgresql-performance

我试图在 Postgres 9.4 中加速这个查询:

SELECT "groupingsFrameHash", COUNT(*) AS nb
FROM "public"."zrac_c1e350bb-a7fc-4f6b-9f49-92dfd1873876"
GROUP BY "groupingsFrameHash"
HAVING COUNT(*) > 1
ORDER BY nb DESC LIMIT 10
Run Code Online (Sandbox Code Playgroud)

我在 上有一个索引"groupingsFrameHash"。我不需要精确的结果,模糊近似就足够了。

这是查询计划:

Limit  (cost=17207.03..17207.05 rows=10 width=25) (actual time=740.056..740.058 rows=10 loops=1)
  ->  Sort  (cost=17207.03..17318.19 rows=44463 width=25) (actual time=740.054..740.055 rows=10 loops=1)
        Sort Key: (count(*))
        Sort Method: top-N heapsort  Memory: 25kB
        ->  GroupAggregate  (cost=14725.95..16246.20 rows=44463 width=25) (actual time=615.109..734.740 rows=25977 loops=1)
              Group Key: "groupingsFrameHash"
              Filter: (count(*) > 1)
              Rows Removed by Filter: 24259
              ->  Sort  (cost=14725.95..14967.07 rows=96446 width=25) (actual time=615.093..705.507 rows=96026 loops=1)
                    Sort Key: "groupingsFrameHash"
                    Sort Method: external merge  Disk: 3280kB
                    ->  Seq Scan on "zrac_c1e350bb-a7fc-4f6b-9f49-92dfd1873876"  (cost=0.00..4431.46 rows=96446 width=25) (actual time=0.007..33.813 rows=96026 loops=1)
Planning time: 0.080 ms
Execution time: 740.877 ms
Run Code Online (Sandbox Code Playgroud)

我不明白为什么它需要进行 Seq Scan。

Erw*_*ter 7

您希望10个最共同的价值观"groupingsFrameHash"各自的数(不包括唯一值) -的共同任务。不过,这个规范引起了我的注意:

模糊近似就足够了

这允许从根本上更快的解决方案。Postgres 恰好只在系统目录中存储那些近似值、总计数pg_class和最常见的值pg_statistic。关于这些数字性质的手册:

条目ANALYZE由查询计划程序创建并随后由查询计划程序使用。请注意,所有统计数据本质上都是近似的,即使假设它是最新的。

你被警告了。

还要考虑手册中规划师使用的统计一章。

如果您已经正确设置了 autovacuum 并且您的表的内容没有发生太多变化,那么这些估计应该是好的。如果您在对表进行实质性更改后立即运行此查询(因此 autovacuum 没有机会启动),ANALYZE请先运行(或者如果您能抽出时间则更好VACUUM ANALYZE)。您还可以微调精度,这超出了这个问题的范围......

安全方面的考虑。再次引用手册:

pg_statistic不应该被公众读取,因为即使是关于表格内容的统计信息也可能被认为是敏感的。(例如:salary 列的最小值和最大值可能很有趣。)pg_stats是一个公开可读的视图 pg_statistic,它只公开有关当前用户可读的那些表的信息。

考虑到所有这些,您可以获得快速估算

SELECT v."groupingsFrameHash", (c.reltuples * freq)::int AS estimate_ct
FROM   pg_stats s
CROSS  JOIN LATERAL
   unnest(s.most_common_vals::text::text[]  -- use your actual data type
        , s.most_common_freqs) WITH ORDINALITY v ("groupingsFrameHash", freq, ord)
CROSS  JOIN (
   SELECT reltuples FROM pg_class
   WHERE oid = regclass 'public.zrac_c1e350bb-a7fc-4f6b-9f49-92dfd1873876'
   ) c
WHERE  schemaname = 'public'
AND    tablename  = 'zrac_c1e350bb-a7fc-4f6b-9f49-92dfd1873876'
AND    attname    = 'groupingsFrameHash'  -- case sensitive
ORDER  BY v.ord
LIMIT  10;
Run Code Online (Sandbox Code Playgroud)

此查询中有几个值得注意的功能:

  • 提供所有未转义且区分大小写的标识符字符串。

  • unnest()对于多个阵列,需要 Postgres 9.4或更高版本。细节:

  • pg_stats.most_common_vals is a special column with the data pseudo-type anyarray (not available in user tables). It can store arrays of any type. To decompose, cast to text and then to the array type of your column type. Assuming text[] in the example:

    s.most_common_vals::text::text[]
    
    Run Code Online (Sandbox Code Playgroud)

    Replace with your actual data type.

  • I added WITH ORDINALITY to unnest() (Postgres 9.4 or later) to preserve the original order of elements. Since the numbers in the arrays are ordered by descending frequency, we can work with that sort order right away. Consider:

This takes around 1 ms or less - no matter how many rows there are in your table.

Experimental optimizations

If you still need to squeeze out more performance and you have superuser access, you could use pg_statistic directly:

SELECT v."groupingsFrameHash", (c.reltuples * freq)::int AS estimate_ct
FROM   pg_attribute a 
JOIN   pg_class     c ON  c.oid = a.attrelid
JOIN   pg_statistic s ON  s.starelid = a.attrelid
                      AND s.staattnum = a.attnum
     , unnest(s.stavalues1::text::text[]
            , s.stanumbers1) WITH ORDINALITY v ("groupingsFrameHash", freq, ord)
WHERE  a.attrelid = regclass 'public.zrac_c1e350bb-a7fc-4f6b-9f49-92dfd1873876'
AND    a.attname  = 'groupingsFrameHash'
ORDER  BY v.ord
LIMIT  10;
Run Code Online (Sandbox Code Playgroud)

As we are getting closer to the core of Postgres, you need to know what you are doing. We are relying on implementation details that may change across major Postgres versions (though unlikely). Read details about pg_statistics in the manual and comments in the source code.

To squeeze out the last drop, you could even hard-code the attribute number of your column (which changes if you change the position of the column in your table!) and rely on the order of rows returned by unnest(), which normally works:

SELECT v."groupingsFrameHash", (c.reltuples * freq)::int AS estimate_ct
FROM   pg_class     c
JOIN   pg_statistic s ON s.starelid = c.oid
     , unnest(s.stavalues1::text::text[], s.stanumbers1) v("groupingsFrameHash", freq)    
WHERE  c.oid = regclass 'public.zrac_c1e350bb-a7fc-4f6b-9f49-92dfd1873876'
AND    s.staattnum = int2 '6'  -- hard-coded pg_attribute.attnum
LIMIT  10;
Run Code Online (Sandbox Code Playgroud)

Get your own estimates

With the new TABLESAMPLE feature in Postgres 9.5 you can base your aggregates on a (more or less) random sample of the table:

SELECT birthday, 10 * count(*) AS estimate
FROM   big
TABLESAMPLE SYSTEM (10)
GROUP  BY 1
ORDER  BY estimate DESC
LIMIT  10;
Run Code Online (Sandbox Code Playgroud)

Details:

Exact counts

If you need exact counts, the best query depends on data distribution and value frequencies. Emulating a loose index scan (like @Mihai commented) can very well improve performance - in a limited fashion, though, (like @ypercube commented) since you need to consider all distinct values for your sort order. For relatively few distinct values the technique still pays, but for your example with ~ 25k distinct values in a table of ~ 100k rows the chances are slim. Basics:

But first you probably need to tune your cost settings. Using SET LOCAL enable_seqscan = off; is primarily meant for debugging problems. Using it in your transaction is a measure of last resort. It may seem to fix your problem at hand, but can bite you later.

Rather fix the underlying problem. My educated guess is that your setting for random_page_cost is unrealistically high. If most of your database (or at least most of the relevant parts) fit into available cache, the default setting of 4.0 is typically much too high. Depending on the complete picture it can be as low as 1.1 or even 1.0.

The fact that Postgres incorrectly estimates a sequential scan to be faster, while using the index is ten times faster would be a typical indicator for misconfiguration: