大表的慢窗函数查询

Kyl*_*yle 5 postgresql performance datatypes postgresql-9.4 postgresql-performance

我正在对 PostgreSQL 9.4rc1 上的新数据库设计进行一些性能测试,我看到一些使用窗口函数的查询速度很慢。这是我的表设置:

CREATE TABLE player_stat (
  player_id    VARCHAR(200) NOT NULL,
  stat_id      BIGINT NOT NULL,
  value        BIGINT NOT NULL DEFAULT 0,
  last_updated TIMESTAMP WITH TIME ZONE NOT NULL,
  last_active  TIMESTAMP WITH TIME ZONE DEFAULT NULL,

  CONSTRAINT player_stat_pk PRIMARY KEY (player_id, stat_id),
  CONSTRAINT player_stat_fk1 FOREIGN KEY(stat_id) REFERENCES stat (id)
);
CREATE INDEX player_stat_stat_value_player_desc
  ON player_stat (stat_id, value DESC, player_id ASC);
Run Code Online (Sandbox Code Playgroud)

我在这个表中插入了 3000 万行,分为 3 个统计数据:

INSERT INTO player_stat (player_id, stat_id, value, last_updated) SELECT x.id, 1, x.v, now() FROM (SELECT generate_series(1,10000000) as id, trunc(random() * (1900-1200) + 1200) as v) AS x;
INSERT INTO player_stat (player_id, stat_id, value, last_updated) SELECT x.id, 2, x.v, now() FROM (SELECT generate_series(1,10000000) as id, trunc(random() * (1900-1200) + 1200) as v) AS x;
INSERT INTO player_stat (player_id, stat_id, value, last_updated) SELECT x.id, 3, x.v, now() FROM (SELECT generate_series(1,10000000) as id, trunc(random() * (1900-1200) + 1200) as v) AS x;
Run Code Online (Sandbox Code Playgroud)

然后我尝试根据给定的统计数据(EDIT)对球员进行排名:

SELECT * FROM 
( SELECT player_id
       , rank() OVER (ORDER BY value DESC, player_id ASC)  as rank 
  FROM player_stat 
  WHERE stat_id = 1
) as t 
WHERE rank <= 20 
ORDER BY rank ASC;
Run Code Online (Sandbox Code Playgroud)

此查询大约需要 5.5 秒才能返回。在其上运行解释会返回以下内容:

"Sort  (cost=1167612.28..1176082.26 rows=3387993 width=15) (actual time=9726.132..9726.135 rows=20 loops=1)"
"  Sort Key: t.rank"
"  Sort Method: quicksort  Memory: 25kB"
"  ->  Subquery Scan on t  (cost=0.56..684349.57 rows=3387993 width=15) (actual time=0.080..9726.116 rows=20 loops=1)"
"        Filter: (t.rank <= 20)"
"        Rows Removed by Filter: 9999980"
"        ->  WindowAgg  (cost=0.56..557299.83 rows=10163979 width=15) (actual time=0.077..8351.124 rows=10000000 loops=1)"
"              ->  Index Only Scan using player_stat_stat_value_player_desc on player_stat  (cost=0.56..379430.20 rows=10163979 width=15) (actual time=0.054..2319.007 rows=10000000 loops=1)"
"                    Index Cond: (stat_id = 1)"
"                    Heap Fetches: 0"
"Planning time: 0.187 ms"
"Execution time: 9726.172 ms"
Run Code Online (Sandbox Code Playgroud)

有什么办法可以加快速度吗?花费的时间似乎随着表中玩家的数量呈线性增长。

Erw*_*ter 6

有什么办法可以加快速度吗?

是的。不要varcharinteger数字使用列。使用integer或者bigint如果您刻录那么多 ID - 表和索引中要小得多,处理速度也更快。由于您在测试中对1000 万行进行排名,这将产生重大影响。

player_id VARCHAR(200) NOT NULL,
player_id int NOT NULL,

或者uuid如果你必须(我怀疑):

您的查询对 1000 万行进行排名。这将需要一些时间,即使直接从索引读取并且没有排序步骤也是如此。

附注:如果您将大量插入行第一,并添加索引和PK约束(和FK约束)后,这将是快,再加上你不膨胀完美指标,但不运行REINDEXVACUUM FULLANALYZE不过,在测试性能之前,请确保已在桌面上运行。

你没有问的

..但是,在这里出去玩,可能正在寻找什么。

EXPLAIN输出表明您筛选前20行:(t.rank <= 20)您提出的查询没有显示。实际匹配您的EXPLAIN输出的查询将是:

SELECT * FROM (
   SELECT player_id
        , rank() OVER (ORDER BY value DESC, player_id ASC) AS rank
   FROM   player_stat
   WHERE  stat_id = 1
   ) t
WHERE t.rank <= 20;
Run Code Online (Sandbox Code Playgroud)

可以显着改进:

SELECT row_number() OVER (ORDER BY value DESC, player_id ASC) AS rank
     , player_id
FROM   player_stat
WHERE  stat_id = 1
ORDER  BY value DESC, player_id
LIMIT  20;
Run Code Online (Sandbox Code Playgroud)

解释

  • 性能的重要部分是LIMITORDER BY匹配索引相结合的子句:现在查询从顶部到索引正好读取 20 行,在原始版本中它必须读取 10000000。我们只使用player_idand value,所以我们仍然可以进行仅索引扫描。剩下的就是花生了。

  • 这完全是由于查询中的事件序列SELECT:窗口函数 LIMIT. 仅当排序顺序一致时,我们才不必考虑其余适用的 10000000 行。

  • 我们可以使用,LIMIT 20因为前 20 个排名保证跨越不超过 20 行。PK on(player_id, stat_id)保证player_id每个唯一,stat_id并且因为它包含在 中ORDER BY,所以每个等级只分配一次 - 这也意味着我们可以使用稍微便宜的row_number()代替。