Tre*_*reg 7 sql postgresql performance group-by
在PostgreSQL 9.2中,我有一个由用户评价的项目表:
id | userid | itemid | rating | timestamp | !update_time
--------+--------+--------+---------------+---------------------+------------------------
522241 | 3991 | 6887 | 0.1111111111 | 2005-06-20 03:13:56 | 2013-10-11 17:50:24.545
522242 | 3991 | 6934 | 0.1111111111 | 2005-04-05 02:25:21 | 2013-10-11 17:50:24.545
522243 | 3991 | 6936 | -0.1111111111 | 2005-03-31 03:17:25 | 2013-10-11 17:50:24.545
522244 | 3991 | 6942 | -0.3333333333 | 2005-03-24 04:38:02 | 2013-10-11 17:50:24.545
522245 | 3991 | 6951 | -0.5555555556 | 2005-06-20 03:15:35 | 2013-10-11 17:50:24.545
... | ... | ... | ... | ... | ...
Run Code Online (Sandbox Code Playgroud)
我想执行一个非常简单的查询:对于每个用户,选择数据库中的评级总数.
我正在使用以下简单的方法:
SELECT userid, COUNT(*) AS rcount
FROM ratings
GROUP BY userid
Run Code Online (Sandbox Code Playgroud)
该表包含10M记录.查询需要......好吧,大概2或3分钟.老实说,我对此并不满意,而且我认为10M对于查询而言并不是那么长.(或者是...... ??)
从此以后,我要求PostgreSQL向我展示执行计划:
EXPLAIN SELECT userid, COUNT(*) AS rcount
FROM ratings
GROUP BY userid
Run Code Online (Sandbox Code Playgroud)
这导致:
GroupAggregate (cost=1756177.54..1831423.30 rows=24535 width=5)
-> Sort (cost=1756177.54..1781177.68 rows=10000054 width=5)
Sort Key: userid
-> Seq Scan on ratings (cost=0.00..183334.54 rows=10000054 width=5)
Run Code Online (Sandbox Code Playgroud)
我读这个如下:首先,从磁盘读取整个表(seq扫描).其次,它按用户ID n*log(n)
排序(排序).最后,逐行读取已排序的表并以线性时间聚合.好吧,不完全是我认为的最佳算法,如果我自己实现它,我会使用哈希表并在第一遍中构建结果.没关系.
似乎是排序userid
需要很长时间.所以添加了一个索引:
CREATE INDEX ratings_userid_index ON ratings (userid)
Run Code Online (Sandbox Code Playgroud)
不幸的是,这没有帮助,性能保持不变.我绝对不认为自己是一个高级用户,我相信我做的事情根本就是错误的.然而,这是我被卡住的地方.我将不胜感激如何在合理的时间内执行查询.还有一点需要注意:PostgreSQL工作进程在执行过程中使用了100%的CPU内核,这表明磁盘访问不是主要的瓶颈.
编辑
按照@a_horse_with_no_name的要求.哇,对我来说很先进:
EXPLAIN (analyze on, buffers on, verbose on)
SELECT userid,COUNT(userid) AS rcount
FROM movielens_10m.ratings
GROUP BY userId
Run Code Online (Sandbox Code Playgroud)
输出:
GroupAggregate (cost=1756177.54..1831423.30 rows=24535 width=5) (actual time=110666.899..127168.304 rows=69878 loops=1)
Output: userid, count(userid)
Buffers: shared hit=906 read=82433, temp read=19358 written=19358
-> Sort (cost=1756177.54..1781177.68 rows=10000054 width=5) (actual time=110666.838..125180.683 rows=10000054 loops=1)
Output: userid
Sort Key: ratings.userid
Sort Method: external merge Disk: 154840kB
Buffers: shared hit=906 read=82433, temp read=19358 written=19358
-> Seq Scan on movielens_10m.ratings (cost=0.00..183334.54 rows=10000054 width=5) (actual time=0.019..2889.583 rows=10000054 loops=1)
Output: userid
Buffers: shared hit=901 read=82433
Total runtime: 127193.524 ms
Run Code Online (Sandbox Code Playgroud)
编辑2
@ a_horse_with_no_name的评论解决了这个问题.我很高兴分享我的发现:
SET work_mem = '1MB';
EXPLAIN SELECT userid,COUNT(userid) AS rcount
FROM movielens_10m.ratings
GROUP BY userId
Run Code Online (Sandbox Code Playgroud)
产生与上面相同:
GroupAggregate (cost=1756177.54..1831423.30 rows=24535 width=5)
-> Sort (cost=1756177.54..1781177.68 rows=10000054 width=5)
Sort Key: userid
-> Seq Scan on ratings (cost=0.00..183334.54 rows=10000054 width=5)
Run Code Online (Sandbox Code Playgroud)
然而,
SET work_mem = '10MB';
EXPLAIN SELECT userid,COUNT(userid) AS rcount
FROM movielens_10m.ratings
GROUP BY userId
Run Code Online (Sandbox Code Playgroud)
给
HashAggregate (cost=233334.81..233580.16 rows=24535 width=5)
-> Seq Scan on ratings (cost=0.00..183334.54 rows=10000054 width=5)
Run Code Online (Sandbox Code Playgroud)
查询现在只需3.5秒即可完成.
小智 0
请尝试如下所示,因为COUNT(*)
和COUNT(userid)
会产生很大的差异。
SELECT userid, COUNT(userid) AS rcount
FROM ratings
GROUP BY userid
Run Code Online (Sandbox Code Playgroud)