MySQL 与 PostgreSQL:对 COUNT(*) 执行速度进行基准测试

Ale*_*xey 6 mysql postgresql count

我对数据库进行基准测试以找出最适合我的项目的数据库,我发现这count(*)在 PostgeSQL 中非常慢。我不明白这是 PostgeSQL 的正常行为还是我做错了什么。

我有一个包含 ~200M 记录的表。MySQL表定义:

CREATE TABLE t1 (
  id int(11) NOT NULL AUTO_INCREMENT,
  t2_id int(11) NOT NULL,
....  
  PRIMARY KEY (id),
  KEY index_t2 (t2_id)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
Run Code Online (Sandbox Code Playgroud)

请求(返回~30M):

SELECT COUNT(*) FROM t1 WHERE t2_id = 7;
Run Code Online (Sandbox Code Playgroud)

运行:

25,797ms MySQL (v5.7.11)

1,222,168ms PostgeSQL (v9.5)

解释:

MySQL:

*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: t1
   partitions: NULL
         type: ref
possible_keys: index_t2
          key: index_t2
      key_len: 4
          ref: const
         rows: 59438630
     filtered: 100.00
        Extra: Using index
1 row in set, 1 warning (0.00 sec)
Run Code Online (Sandbox Code Playgroud)

PostgreSQL

Aggregate  (cost=4469365.02..4469365.03 rows=1 width=0)
 ->  Bitmap Heap Scan on t1  (cost=715817.34..4382635.74 rows=34691712 width=0)
       Recheck Cond: (t2_id = 7)
       ->  Bitmap Index Scan on index_t2  (cost=0.00..707144.41 rows=34691712 width=0)
             Index Cond: (t2_id = 7)
Run Code Online (Sandbox Code Playgroud)

服务器:AWS RDS (db.r3.xlarge) vCPU:4 内存:30Gb

更新 (2016-09-20):

> explain (analyze, buffers) SELECT COUNT(*) FROM t1 WHERE t2_id = 7;

QUERY PLAN                                                                                     
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=4469365.02..4469365.03 rows=1 width=4) (actual time=1213456.539..1213456.539 rows=1 loops=1)
   Buffers: shared read=2734808
   ->  Bitmap Heap Scan on t1  (cost=715817.34..4382635.74 rows=34691712 width=4) (actual time=64015.828..1205542.421 rows=31383566 loops=1)
         Recheck Cond: (t2_id = 7)
         Rows Removed by Index Recheck: 108582028
         Heap Blocks: exact=19929 lossy=2606242
         Buffers: shared read=2734808
         ->  Bitmap Index Scan on index_t2  (cost=0.00..707144.41 rows=34691712 width=0) (actual time=64009.598..64009.598 rows=31383566 loops=1)
               Index Cond: (t2_id = 7)
               Buffers: shared read=108637
 Planning time: 0.080 ms
 Execution time: 1213456.891 ms
(12 rows)

Time: 1213484.579 ms
Run Code Online (Sandbox Code Playgroud)

更新 (2016-09-21):

> explain (analyze, buffers) SELECT t2_id FROM t1 WHERE t2_id = 7;
                                                                                  QUERY PLAN                                                                                  
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on t1  (cost=715817.34..4382635.74 rows=34691712 width=114) (actual time=59954.834..1234070.436 rows=31383566 loops=1)
   Recheck Cond: (t2_id = 7)
   Rows Removed by Index Recheck: 108582028
   Heap Blocks: exact=19929 lossy=2606242
   Buffers: shared hit=4824 read=2729984
   ->  Bitmap Index Scan on index_t2  (cost=0.00..707144.41 rows=34691712 width=0) (actual time=59948.598..59948.598 rows=31383566 loops=1)
         Index Cond: (t2_id = 7)
         Buffers: shared hit=4824 read=103813
 Planning time: 0.086 ms
 Execution time: 1239826.408 ms
(10 rows)

Time: 1239827.053 ms
Run Code Online (Sandbox Code Playgroud)

3ma*_*uek 5

两种 RDBMS 进行计数的方式不同。在InnoDB中,我们默认有以下行为:

为了处理 SELECT COUNT(*) FROM t 语句,InnoDB 会扫描表的索引,如果索引不完全在缓冲池中,则需要一些时间。

对于 Postgres,您可能想尝试看看仅索引扫描(更接近 InnoDB 行为)是否可以帮助您解决此问题。更多信息请点击这里。由于行数和该值的不良基数(根据统计数据,几乎占表的 15%),我不能保证它会起作用,但您可以尝试:

SELECT COUNT(t2_id) FROM t1 WHERE t2_id = 7; 
Run Code Online (Sandbox Code Playgroud)