tom*_*mka 28 postgresql performance index count group-by
我正在运行 PostgresSQL 9.2 并且有一个 12 列的关系,大约有 6,700,000 行。它包含 3D 空间中的节点,每个节点都引用一个用户(创建它的人)。要查询哪个用户创建了多少个节点,我执行以下操作(添加explain analyze以获取更多信息):
EXPLAIN ANALYZE SELECT user_id, count(user_id) FROM treenode WHERE project_id=1 GROUP BY user_id;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=253668.70..253669.07 rows=37 width=8) (actual time=1747.620..1747.623 rows=38 loops=1)
-> Seq Scan on treenode (cost=0.00..220278.79 rows=6677983 width=8) (actual time=0.019..886.803 rows=6677983 loops=1)
Filter: (project_id = 1)
Total runtime: 1747.653 ms
Run Code Online (Sandbox Code Playgroud)
如您所见,这大约需要 1.7 秒。考虑到数据量,这还算不错,但我想知道这是否可以改进。我尝试在用户列上添加 BTree 索引,但这没有任何帮助。
您有其他建议吗?
为了完整起见,这是完整的表定义及其所有索引(没有外键约束、引用和触发器):
Column | Type | Modifiers
---------------+--------------------------+------------------------------------------------------
id | bigint | not null default nextval('concept_id_seq'::regclass)
user_id | bigint | not null
creation_time | timestamp with time zone | not null default now()
edition_time | timestamp with time zone | not null default now()
project_id | bigint | not null
location | double3d | not null
reviewer_id | integer | not null default (-1)
review_time | timestamp with time zone |
editor_id | integer |
parent_id | bigint |
radius | double precision | not null default 0
confidence | integer | not null default 5
skeleton_id | bigint |
Indexes:
"treenode_pkey" PRIMARY KEY, btree (id)
"treenode_id_key" UNIQUE CONSTRAINT, btree (id)
"skeleton_id_treenode_index" btree (skeleton_id)
"treenode_editor_index" btree (editor_id)
"treenode_location_x_index" btree (((location).x))
"treenode_location_y_index" btree (((location).y))
"treenode_location_z_index" btree (((location).z))
"treenode_parent_id" btree (parent_id)
"treenode_user_index" btree (user_id)
Run Code Online (Sandbox Code Playgroud)
编辑:这是结果,当我使用@ypercube 提出的查询(和索引)时(没有 查询需要大约 5.3 秒EXPLAIN ANALYZE):
EXPLAIN ANALYZE SELECT u.id, ( SELECT COUNT(*) FROM treenode AS t WHERE t.project_id=1 AND t.user_id = u.id ) AS number_of_nodes FROM auth_user As u;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------
Seq Scan on auth_user u (cost=0.00..6987937.85 rows=46 width=4) (actual time=29.934..5556.147 rows=46 loops=1)
SubPlan 1
-> Aggregate (cost=151911.65..151911.66 rows=1 width=0) (actual time=120.780..120.780 rows=1 loops=46)
-> Bitmap Heap Scan on treenode t (cost=4634.41..151460.44 rows=180486 width=0) (actual time=13.785..114.021 rows=145174 loops=46)
Recheck Cond: ((project_id = 1) AND (user_id = u.id))
Rows Removed by Index Recheck: 461076
-> Bitmap Index Scan on treenode_user_index (cost=0.00..4589.29 rows=180486 width=0) (actual time=13.082..13.082 rows=145174 loops=46)
Index Cond: ((project_id = 1) AND (user_id = u.id))
Total runtime: 5556.190 ms
(9 rows)
Time: 5556.804 ms
Run Code Online (Sandbox Code Playgroud)
编辑 2:这是结果,当我使用@erwin-brandstetter 建议的indexon project_id, user_id(但没有架构优化)时(查询以与我的原始查询相同的速度运行 1.5 秒):
EXPLAIN ANALYZE SELECT user_id, count(user_id) as ct FROM treenode WHERE project_id=1 GROUP BY user_id;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=253670.88..253671.24 rows=37 width=8) (actual time=1807.334..1807.339 rows=38 loops=1)
-> Seq Scan on treenode (cost=0.00..220280.62 rows=6678050 width=8) (actual time=0.183..893.491 rows=6678050 loops=1)
Filter: (project_id = 1)
Total runtime: 1807.368 ms
(4 rows)
Run Code Online (Sandbox Code Playgroud)
Erw*_*ter 31
主要问题是缺少索引。但还有更多。
SELECT user_id, count(*) AS ct
FROM treenode
WHERE project_id = 1
GROUP BY user_id;
Run Code Online (Sandbox Code Playgroud)
你有很多bigint列。大概是矫枉过正了。通常,integer对于project_id和 之类的列来说已经足够了user_id。这也将有助于下一项。
在优化表定义时,请考虑这个相关的答案,重点是数据对齐和填充。但其余的大部分也适用:
房间里的大象:没有索引project_id。创建一个。这比这个答案的其余部分更重要。
在此期间,将其设为多列索引:
CREATE INDEX treenode_project_id_user_id_index ON treenode (project_id, user_id);
Run Code Online (Sandbox Code Playgroud)
如果你听从我的建议,integer这里会很完美:
user_id被定义NOT NULL,所以count(user_id)等价于count(*),但后者更短更快。(在这个特定的查询中,这甚至在没有user_id定义的情况下也适用NOT NULL。)
id已经是主键,附加UNIQUE约束是无用的 Ballast。算了吧:
"treenode_pkey" PRIMARY KEY, btree (id)
"treenode_id_key" UNIQUE CONSTRAINT, btree (id)Run Code Online (Sandbox Code Playgroud)
旁白:我不会id用作列名。使用诸如treenode_id.
问:How many different project_id and user_id?
答:not more than five different project_id。
这意味着 Postgres 必须读取整个表的大约 20%才能满足您的查询。除非它可以使用仅索引扫描,否则对表的顺序扫描将比涉及任何索引更快。这里没有更多的性能 - 除了优化表和服务器设置。
至于仅索引扫描:要查看其效果如何,VACUUM ANALYZE请在负担得起的情况下运行(以独占方式锁定表)。然后再次尝试您的查询。现在只使用索引应该会稍微快一点。首先阅读此相关答案:
以及Postgres 9.6 添加的手册页和Postgres Wiki on index-only scans。
我首先添加一个索引(project_id, user_id),然后在 9.3 版本中,试试这个查询:
SELECT u.user_id, c.number_of_nodes
FROM users AS u
, LATERAL
( SELECT COUNT(*) AS number_of_nodes
FROM treenode AS t
WHERE t.project_id = 1
AND t.user_id = u.user_id
) c
-- WHERE c.number_of_nodes > 0 ; -- you probably want this as well
-- to show only relevant users
Run Code Online (Sandbox Code Playgroud)
在 9.2 中,试试这个:
SELECT u.user_id,
( SELECT COUNT(*)
FROM treenode AS t
WHERE t.project_id = 1
AND t.user_id = u.user_id
) AS number_of_nodes
FROM users AS u ;
Run Code Online (Sandbox Code Playgroud)
我假设你有一张users桌子。如果没有,请替换users为:
(SELECT DISTINCT user_id FROM treenode)