Ste*_*n V 8 mysql performance index index-tuning query-performance
我有一个查询,用于获取某些 IP 地址的互联网流量统计信息。
有单独的 IP 地址字段hosts
和称为 的 IP 块assignments
。数据每隔 5 分钟存储一次。
查询结果按时间列分组,并且使用这 5 分钟间隔内和外的总 SUM 绘制图形。
该表被调用traffic
并包含(在月底)大约 2100 万条记录。
SHOW CREATE table traffic:
CREATE TABLE `traffic` (
`type` enum('v4_assignment','v4_host','v6_subnet','v6_assignment','v6_host') NOT NULL,
`type_id` int(11) unsigned NOT NULL,
`time` int(32) unsigned NOT NULL,
`bytesin` bigint(20) unsigned NOT NULL default '0',
`bytesout` bigint(20) unsigned NOT NULL default '0',
KEY `basic_select` (`type_id`,`time`,`type`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
Run Code Online (Sandbox Code Playgroud)
SELECT traffic.time, SUM(traffic.bytesin), SUM(traffic.bytesout) FROM traffic
WHERE (
( traffic.type = 'v4_assignment' AND type_id IN (231, between 20 to 100 ids,265)) OR
( traffic.type = 'v4_host' AND type_id IN (131, ... a lot of ids... ,1506)))
AND traffic.time >= 1343772000 AND traffic.time < 1346450399
GROUP BY traffic.time
ORDER BY traffic.time;
Run Code Online (Sandbox Code Playgroud)
以下是explain
上述查询的输出:
+----+-------------+---------+-------+---------------+--------------+---------+------+--------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+---------+-------+---------------+--------------+---------+------+--------+----------------------------------------------+
| 1 | SIMPLE | traffic | range | basic_select | basic_select | 8 | NULL | 891319 | Using where; Using temporary; Using filesort |
+----+-------------+---------+-------+---------------+--------------+---------+------+--------+----------------------------------------------+
show indexes from traffic;
+---------+------------+--------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment |
+---------+------------+--------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| traffic | 1 | basic_select | 1 | type_id | A | 13835 | NULL | NULL | | BTREE | |
| traffic | 1 | basic_select | 2 | time | A | 18470357 | NULL | NULL | | BTREE | |
| traffic | 1 | basic_select | 3 | type | A | 18470357 | NULL | NULL | | BTREE | |
+---------+------------+--------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
Run Code Online (Sandbox Code Playgroud)
完成此查询需要 30 秒到 30 分钟。我希望我可以使用更好的索引来改进事情,或者使用不同的查询,但我无法弄清楚。
更新:
按照有用评论者的建议,我创建了一个主键并添加了 index traffic_pk (time, type, type_id, id)
。不幸的是,事实证明这个新索引的基数等于/低于我的原始索引(basic_select)并且 MySQL 仍然使用我的原始键。
更新 2:
我删除了原始索引basic_select
,现在EXPLAIN
显示了更高的rows
值,但EXTRA
字段中的步骤更少。查询执行时间也下降到不到一分钟!(仍然有点太慢,但一个重大的改进!)。
mysql> SHOW CREATE TABLE traffic_test \G;
*************************** 1. row ***************************
Table: traffic_test
Create Table: CREATE TABLE `traffic_test` (
`traffic_id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`type` enum('v4_assignment','v4_host','v6_subnet','v6_assignment','v6_host') NOT NULL,
`type_id` int(11) unsigned NOT NULL,
`time` int(32) unsigned NOT NULL,
`bytesin` bigint(20) unsigned NOT NULL DEFAULT '0',
`bytesout` bigint(20) unsigned NOT NULL DEFAULT '0',
PRIMARY KEY (`time`,`type`,`type_id`,`traffic_id`),
KEY `traffic_id_IDX` (`traffic_id`)
) ENGINE=InnoDB AUTO_INCREMENT=24545159 DEFAULT CHARSET=latin1
Run Code Online (Sandbox Code Playgroud)
表上的索引:
mysql> SHOW INDEX FROM traffic;
+--------------+------------+----------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment |
+--------------+------------+----------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| traffic_test | 0 | PRIMARY | 1 | time | A | 18 | NULL | NULL | | BTREE | |
| traffic_test | 0 | PRIMARY | 2 | type | A | 38412 | NULL | NULL | | BTREE | |
| traffic_test | 0 | PRIMARY | 3 | type_id | A | 24545609 | NULL | NULL | | BTREE | |
| traffic_test | 0 | PRIMARY | 4 | traffic_id | A | 24545609 | NULL | NULL | | BTREE | |
| traffic_test | 1 | traffic_id_IDX | 1 | traffic_id | A | 24545609 | NULL | NULL | | BTREE | |
+--------------+------------+----------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
Run Code Online (Sandbox Code Playgroud)
我还通过不使用以下内容简化了查询OR
:
SELECT SQL_NO_CACHE traffic.time, SUM(traffic.bytesin), SUM(traffic.bytesout)
FROM traffic
WHERE traffic.type LIKE 'v4_host' AND type_id IN (131,1974,1976,1514,1516,2767,2730,2731,2732,2733,2734,2769,2994,2709,1,4613,4614,4615,4616,326,1520,2652,1518,1521,1522,1523,1524,1525,2203,1515,1513,1467,1508,1973,1510,1975,1511,1475,1476,1468,1469,1470,1471,1472,1473,1500,1507,1478,1480,1481,1482,1483,1484,1485,1479,1486,1487,1488,1489,1490,1491,1495,1499,1494,2269,1474,1519,2204,2976,1922,1493,1492,1497,1496,1498,1501,1502,1503,1526,1509,1506)
AND traffic.time >= 1342181721
AND traffic.time < 1343391321
GROUP BY traffic.time ASC;
Run Code Online (Sandbox Code Playgroud)
此查询的旧执行:
3980 rows in set (6 min 15.27 sec)
Run Code Online (Sandbox Code Playgroud)
新的执行时间:
3980 rows in set (24.80 sec)
Run Code Online (Sandbox Code Playgroud)
解释输出:
+----+-------------+---------+-------+---------------+---------+---------+------+----------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+---------+-------+---------------+---------+---------+------+----------+-------------+
| 1 | SIMPLE | traffic | range | PRIMARY | PRIMARY | 4 | NULL | 12272804 | Using where |
+----+-------------+---------+-------+---------------+---------+---------+------+----------+-------------+
Run Code Online (Sandbox Code Playgroud)
行值仍然很高。我想我可以通过切换索引中的type
和的顺序来改进这一点,type_id
因为只有 4 种可能的类型和更多的 type_id。
这是一个正确的假设吗?
由于 [AND traffic.time >= 1343772000 AND traffic.time < 1346450399] 子句,我想您永远不会从该表中删除数据,或者该表当前存储了多个月的数据。[time] 列中的值似乎是 unix 时间戳(1346450399 = Fri, 31 Aug 2012 21:59:59 GMT)根据时间列对表进行分区。这将加快数据检索,因为数据库将扫描相应的分区(比扫描整个表快得多)。
因为“OR”在你的 WHERE 块中,优化器会选择不使用定义的索引。尝试将查询拆分为 2 个选择,并进行联合。
SELECT
traffic.time,
SUM(traffic.bytesin),
SUM(traffic.bytesout)
FROM
traffic
WHERE traffic.type LIKE 'v4_assignment'
AND type_id IN (1,2,3,4)
AND traffic.time >= 1343772000 AND traffic.time <= 1346450399
GROUP BY
traffic.time
UNION
SELECT
traffic.time,
SUM(traffic.bytesin),
SUM(traffic.bytesout)
FROM
traffic
WHERE traffic.type LIKE 'v4_host'
AND type_id IN (5,6,7,8)
AND traffic.time >= 1343772000 AND traffic.time <= 1346450399
GROUP BY
traffic.time
ORDER BY
traffic.time
Run Code Online (Sandbox Code Playgroud)
根据您的解释输出,我没有看到正在使用的索引。也许是因为优化器决定进行全表扫描然后跟踪索引会更容易(更便宜)。此外,在您当前的索引中,第一列的基数低于接下来的 2 列。任何索引中的第一列应该是具有最佳(最大)基数的列。
创建一个新索引为:
MYSQL> CREATE INDEX MTIhai_traffic_idx1 ON traffic(time, type, type_id)
Run Code Online (Sandbox Code Playgroud)