如何提高对 2000 万+ 行表的查询速度?

Ste*_*n V 8 mysql performance index index-tuning query-performance

我有一个查询,用于获取某些 IP 地址的互联网流量统计信息。

有单独的 IP 地址字段hosts和称为 的 IP 块assignments。数据每隔 5 分钟存储一次。

查询结果按时间列分组,并且使用这 5 分钟间隔内和外的总 SUM 绘制图形。

该表被调用traffic并包含(在月底)大约 2100 万条记录。

SHOW CREATE table traffic:
CREATE TABLE `traffic` (
  `type` enum('v4_assignment','v4_host','v6_subnet','v6_assignment','v6_host') NOT NULL,
  `type_id` int(11) unsigned NOT NULL,
  `time` int(32) unsigned NOT NULL,
  `bytesin` bigint(20) unsigned NOT NULL default '0',
  `bytesout` bigint(20) unsigned NOT NULL default '0',
  KEY `basic_select` (`type_id`,`time`,`type`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
Run Code Online (Sandbox Code Playgroud)
SELECT traffic.time, SUM(traffic.bytesin), SUM(traffic.bytesout) FROM traffic 
WHERE (
    ( traffic.type = 'v4_assignment' AND type_id IN (231, between 20 to 100 ids,265)) OR 
    ( traffic.type = 'v4_host' AND type_id IN (131, ... a lot of ids... ,1506))) 
    AND traffic.time >= 1343772000 AND traffic.time < 1346450399 
GROUP BY traffic.time
ORDER BY traffic.time;
Run Code Online (Sandbox Code Playgroud)

以下是explain上述查询的输出:

+----+-------------+---------+-------+---------------+--------------+---------+------+--------+----------------------------------------------+
| id | select_type | table   | type  | possible_keys | key          | key_len | ref  | rows   | Extra                                        |
+----+-------------+---------+-------+---------------+--------------+---------+------+--------+----------------------------------------------+
|  1 | SIMPLE      | traffic | range | basic_select  | basic_select | 8       | NULL | 891319 | Using where; Using temporary; Using filesort |
+----+-------------+---------+-------+---------------+--------------+---------+------+--------+----------------------------------------------+

show indexes from traffic;
+---------+------------+--------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| Table   | Non_unique | Key_name     | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment |
+---------+------------+--------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| traffic |          1 | basic_select |            1 | type_id     | A         |       13835 |     NULL | NULL   |      | BTREE      |         |
| traffic |          1 | basic_select |            2 | time        | A         |    18470357 |     NULL | NULL   |      | BTREE      |         |
| traffic |          1 | basic_select |            3 | type        | A         |    18470357 |     NULL | NULL   |      | BTREE      |         |
+---------+------------+--------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
Run Code Online (Sandbox Code Playgroud)

完成此查询需要 30 秒到 30 分钟。我希望我可以使用更好的索引来改进事情,或者使用不同的查询,但我无法弄清楚。

更新:

按照有用评论者的建议,我创建了一个主键并添加了 index traffic_pk (time, type, type_id, id)。不幸的是,事实证明这个新索引的基数等于/低于我的原始索引(basic_select)并且 MySQL 仍然使用我的原始键。

更新 2: 我删除了原始索引basic_select,现在EXPLAIN显示了更高的rows值,但EXTRA字段中的步骤更少。查询执行时间也下降到不到一分钟!(仍然有点太慢,但一个重大的改进!)。

mysql> SHOW CREATE TABLE traffic_test \G;
*************************** 1. row ***************************
       Table: traffic_test
Create Table: CREATE TABLE `traffic_test` (
  `traffic_id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `type` enum('v4_assignment','v4_host','v6_subnet','v6_assignment','v6_host') NOT NULL,
  `type_id` int(11) unsigned NOT NULL,
  `time` int(32) unsigned NOT NULL,
  `bytesin` bigint(20) unsigned NOT NULL DEFAULT '0',
  `bytesout` bigint(20) unsigned NOT NULL DEFAULT '0',
  PRIMARY KEY (`time`,`type`,`type_id`,`traffic_id`),
  KEY `traffic_id_IDX` (`traffic_id`)
) ENGINE=InnoDB AUTO_INCREMENT=24545159 DEFAULT CHARSET=latin1
Run Code Online (Sandbox Code Playgroud)

表上的索引:

mysql> SHOW INDEX FROM traffic;
+--------------+------------+----------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| Table        | Non_unique | Key_name       | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment |
+--------------+------------+----------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| traffic_test |          0 | PRIMARY        |            1 | time        | A         |          18 |     NULL | NULL   |      | BTREE      |         |
| traffic_test |          0 | PRIMARY        |            2 | type        | A         |       38412 |     NULL | NULL   |      | BTREE      |         |
| traffic_test |          0 | PRIMARY        |            3 | type_id     | A         |    24545609 |     NULL | NULL   |      | BTREE      |         |
| traffic_test |          0 | PRIMARY        |            4 | traffic_id  | A         |    24545609 |     NULL | NULL   |      | BTREE      |         |
| traffic_test |          1 | traffic_id_IDX |            1 | traffic_id  | A         |    24545609 |     NULL | NULL   |      | BTREE      |         |
+--------------+------------+----------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
Run Code Online (Sandbox Code Playgroud)

我还通过不使用以下内容简化了查询OR

SELECT SQL_NO_CACHE traffic.time, SUM(traffic.bytesin), SUM(traffic.bytesout) 
FROM    traffic
WHERE traffic.type LIKE 'v4_host' AND type_id IN (131,1974,1976,1514,1516,2767,2730,2731,2732,2733,2734,2769,2994,2709,1,4613,4614,4615,4616,326,1520,2652,1518,1521,1522,1523,1524,1525,2203,1515,1513,1467,1508,1973,1510,1975,1511,1475,1476,1468,1469,1470,1471,1472,1473,1500,1507,1478,1480,1481,1482,1483,1484,1485,1479,1486,1487,1488,1489,1490,1491,1495,1499,1494,2269,1474,1519,2204,2976,1922,1493,1492,1497,1496,1498,1501,1502,1503,1526,1509,1506) 
AND traffic.time >= 1342181721 
AND traffic.time < 1343391321 
GROUP BY traffic.time ASC;
Run Code Online (Sandbox Code Playgroud)

此查询的旧执行:

3980 rows in set (6 min 15.27 sec)
Run Code Online (Sandbox Code Playgroud)

新的执行时间:

3980 rows in set (24.80 sec)
Run Code Online (Sandbox Code Playgroud)

解释输出:

+----+-------------+---------+-------+---------------+---------+---------+------+----------+-------------+
| id | select_type | table   | type  | possible_keys | key     | key_len | ref  | rows     | Extra       |
+----+-------------+---------+-------+---------------+---------+---------+------+----------+-------------+
|  1 | SIMPLE      | traffic | range | PRIMARY       | PRIMARY | 4       | NULL | 12272804 | Using where |
+----+-------------+---------+-------+---------------+---------+---------+------+----------+-------------+
Run Code Online (Sandbox Code Playgroud)

行值仍然很高。我想我可以通过切换索引中的type和的顺序来改进这一点,type_id因为只有 4 种可能的类型和更多的 type_id。

这是一个正确的假设吗?

MTI*_*hai 6

1.表分区

由于 [AND traffic.time >= 1343772000 AND traffic.time < 1346450399] 子句,我想您永远不会从该表中删除数据,或者该表当前存储了多个月的数据。[time] 列中的值似乎是 unix 时间戳(1346450399 = Fri, 31 Aug 2012 21:59:59 GMT)根据时间列对表进行分区。这将加快数据检索,因为数据库将扫描相应的分区(比扫描整个表快得多)。

2. 重写查询

因为“OR”在你的 WHERE 块中,优化器会选择不使用定义的索引。尝试将查询拆分为 2 个选择,并进行联合。

SELECT 
    traffic.time, 
    SUM(traffic.bytesin), 
    SUM(traffic.bytesout) 
FROM 
    traffic 
WHERE traffic.type LIKE 'v4_assignment' 
    AND type_id IN (1,2,3,4)
    AND traffic.time >= 1343772000 AND traffic.time <= 1346450399 
GROUP BY 
    traffic.time
UNION
SELECT 
    traffic.time, 
    SUM(traffic.bytesin), 
    SUM(traffic.bytesout) 
FROM 
    traffic 
WHERE traffic.type LIKE 'v4_host' 
    AND type_id IN (5,6,7,8)
    AND traffic.time >= 1343772000 AND traffic.time <= 1346450399 
GROUP BY 
    traffic.time
ORDER BY 
    traffic.time
Run Code Online (Sandbox Code Playgroud)

3.基于数据基数的新索引

根据您的解释输出,我没有看到正在使用的索引。也许是因为优化器决定进行全表扫描然后跟踪索引会更容易(更便宜)。此外,在您当前的索引中,第一列的基数低于接下来的 2 列。任何索引中的第一列应该是具有最佳(最大)基数的列。

创建一个新索引为:

MYSQL> CREATE INDEX MTIhai_traffic_idx1 ON traffic(time, type, type_id)
Run Code Online (Sandbox Code Playgroud)