Ale*_*xey 6 mysql postgresql count
我对数据库进行基准测试以找出最适合我的项目的数据库,我发现这count(*)
在 PostgeSQL 中非常慢。我不明白这是 PostgeSQL 的正常行为还是我做错了什么。
我有一个包含 ~200M 记录的表。MySQL表定义:
CREATE TABLE t1 (
id int(11) NOT NULL AUTO_INCREMENT,
t2_id int(11) NOT NULL,
....
PRIMARY KEY (id),
KEY index_t2 (t2_id)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
Run Code Online (Sandbox Code Playgroud)
请求(返回~30M):
SELECT COUNT(*) FROM t1 WHERE t2_id = 7;
Run Code Online (Sandbox Code Playgroud)
运行:
25,797ms
MySQL (v5.7.11)
1,222,168ms
PostgeSQL (v9.5)
解释:
MySQL:
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: t1
partitions: NULL
type: ref
possible_keys: index_t2
key: index_t2
key_len: 4
ref: const
rows: 59438630
filtered: 100.00
Extra: Using index
1 row in set, 1 warning (0.00 sec)
Run Code Online (Sandbox Code Playgroud)
PostgreSQL
Aggregate (cost=4469365.02..4469365.03 rows=1 width=0)
-> Bitmap Heap Scan on t1 (cost=715817.34..4382635.74 rows=34691712 width=0)
Recheck Cond: (t2_id = 7)
-> Bitmap Index Scan on index_t2 (cost=0.00..707144.41 rows=34691712 width=0)
Index Cond: (t2_id = 7)
Run Code Online (Sandbox Code Playgroud)
服务器:AWS RDS (db.r3.xlarge) vCPU:4 内存:30Gb
更新 (2016-09-20):
> explain (analyze, buffers) SELECT COUNT(*) FROM t1 WHERE t2_id = 7;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=4469365.02..4469365.03 rows=1 width=4) (actual time=1213456.539..1213456.539 rows=1 loops=1)
Buffers: shared read=2734808
-> Bitmap Heap Scan on t1 (cost=715817.34..4382635.74 rows=34691712 width=4) (actual time=64015.828..1205542.421 rows=31383566 loops=1)
Recheck Cond: (t2_id = 7)
Rows Removed by Index Recheck: 108582028
Heap Blocks: exact=19929 lossy=2606242
Buffers: shared read=2734808
-> Bitmap Index Scan on index_t2 (cost=0.00..707144.41 rows=34691712 width=0) (actual time=64009.598..64009.598 rows=31383566 loops=1)
Index Cond: (t2_id = 7)
Buffers: shared read=108637
Planning time: 0.080 ms
Execution time: 1213456.891 ms
(12 rows)
Time: 1213484.579 ms
Run Code Online (Sandbox Code Playgroud)
更新 (2016-09-21):
> explain (analyze, buffers) SELECT t2_id FROM t1 WHERE t2_id = 7;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on t1 (cost=715817.34..4382635.74 rows=34691712 width=114) (actual time=59954.834..1234070.436 rows=31383566 loops=1)
Recheck Cond: (t2_id = 7)
Rows Removed by Index Recheck: 108582028
Heap Blocks: exact=19929 lossy=2606242
Buffers: shared hit=4824 read=2729984
-> Bitmap Index Scan on index_t2 (cost=0.00..707144.41 rows=34691712 width=0) (actual time=59948.598..59948.598 rows=31383566 loops=1)
Index Cond: (t2_id = 7)
Buffers: shared hit=4824 read=103813
Planning time: 0.086 ms
Execution time: 1239826.408 ms
(10 rows)
Time: 1239827.053 ms
Run Code Online (Sandbox Code Playgroud)
两种 RDBMS 进行计数的方式不同。在InnoDB中,我们默认有以下行为:
为了处理 SELECT COUNT(*) FROM t 语句,InnoDB 会扫描表的索引,如果索引不完全在缓冲池中,则需要一些时间。
对于 Postgres,您可能想尝试看看仅索引扫描(更接近 InnoDB 行为)是否可以帮助您解决此问题。更多信息请点击这里。由于行数和该值的不良基数(根据统计数据,几乎占表的 15%),我不能保证它会起作用,但您可以尝试:
SELECT COUNT(t2_id) FROM t1 WHERE t2_id = 7;
Run Code Online (Sandbox Code Playgroud)