如何提高对 2000 万+ 行表的查询速度？

Question

如何提高对 2000 万+ 行表的查询速度？

Ste*_*n V 8 mysql performance index index-tuning query-performance

我有一个查询，用于获取某些 IP 地址的互联网流量统计信息。

有单独的 IP 地址字段hosts和称为的 IP 块assignments。数据每隔 5 分钟存储一次。

查询结果按时间列分组，并且使用这 5 分钟间隔内和外的总 SUM 绘制图形。

该表被调用traffic并包含（在月底）大约 2100 万条记录。

SHOW CREATE table traffic:
CREATE TABLE `traffic` (
  `type` enum('v4_assignment','v4_host','v6_subnet','v6_assignment','v6_host') NOT NULL,
  `type_id` int(11) unsigned NOT NULL,
  `time` int(32) unsigned NOT NULL,
  `bytesin` bigint(20) unsigned NOT NULL default '0',
  `bytesout` bigint(20) unsigned NOT NULL default '0',
  KEY `basic_select` (`type_id`,`time`,`type`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1

Run Code Online (Sandbox Code Playgroud)

SELECT traffic.time, SUM(traffic.bytesin), SUM(traffic.bytesout) FROM traffic 
WHERE (
    ( traffic.type = 'v4_assignment' AND type_id IN (231, between 20 to 100 ids,265)) OR 
    ( traffic.type = 'v4_host' AND type_id IN (131, ... a lot of ids... ,1506))) 
    AND traffic.time >= 1343772000 AND traffic.time < 1346450399 
GROUP BY traffic.time
ORDER BY traffic.time;

Run Code Online (Sandbox Code Playgroud)

以下是explain上述查询的输出：

+----+-------------+---------+-------+---------------+--------------+---------+------+--------+----------------------------------------------+
| id | select_type | table   | type  | possible_keys | key          | key_len | ref  | rows   | Extra                                        |
+----+-------------+---------+-------+---------------+--------------+---------+------+--------+----------------------------------------------+
|  1 | SIMPLE      | traffic | range | basic_select  | basic_select | 8       | NULL | 891319 | Using where; Using temporary; Using filesort |
+----+-------------+---------+-------+---------------+--------------+---------+------+--------+----------------------------------------------+

show indexes from traffic;
+---------+------------+--------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| Table   | Non_unique | Key_name     | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment |
+---------+------------+--------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| traffic |          1 | basic_select |            1 | type_id     | A         |       13835 |     NULL | NULL   |      | BTREE      |         |
| traffic |          1 | basic_select |            2 | time        | A         |    18470357 |     NULL | NULL   |      | BTREE      |         |
| traffic |          1 | basic_select |            3 | type        | A         |    18470357 |     NULL | NULL   |      | BTREE      |         |
+---------+------------+--------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+

Run Code Online (Sandbox Code Playgroud)

完成此查询需要 30 秒到 30 分钟。我希望我可以使用更好的索引来改进事情，或者使用不同的查询，但我无法弄清楚。

更新：

按照有用评论者的建议，我创建了一个主键并添加了 index traffic_pk (time, type, type_id, id)。不幸的是，事实证明这个新索引的基数等于/低于我的原始索引（basic_select）并且 MySQL 仍然使用我的原始键。

更新 2： 我删除了原始索引basic_select，现在EXPLAIN显示了更高的rows值，但EXTRA字段中的步骤更少。查询执行时间也下降到不到一分钟！（仍然有点太慢，但一个重大的改进！）。

mysql> SHOW CREATE TABLE traffic_test \G;
*************************** 1. row ***************************
       Table: traffic_test
Create Table: CREATE TABLE `traffic_test` (
  `traffic_id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `type` enum('v4_assignment','v4_host','v6_subnet','v6_assignment','v6_host') NOT NULL,
  `type_id` int(11) unsigned NOT NULL,
  `time` int(32) unsigned NOT NULL,
  `bytesin` bigint(20) unsigned NOT NULL DEFAULT '0',
  `bytesout` bigint(20) unsigned NOT NULL DEFAULT '0',
  PRIMARY KEY (`time`,`type`,`type_id`,`traffic_id`),
  KEY `traffic_id_IDX` (`traffic_id`)
) ENGINE=InnoDB AUTO_INCREMENT=24545159 DEFAULT CHARSET=latin1

Run Code Online (Sandbox Code Playgroud)

表上的索引：

mysql> SHOW INDEX FROM traffic;
+--------------+------------+----------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| Table        | Non_unique | Key_name       | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment |
+--------------+------------+----------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| traffic_test |          0 | PRIMARY        |            1 | time        | A         |          18 |     NULL | NULL   |      | BTREE      |         |
| traffic_test |          0 | PRIMARY        |            2 | type        | A         |       38412 |     NULL | NULL   |      | BTREE      |         |
| traffic_test |          0 | PRIMARY        |            3 | type_id     | A         |    24545609 |     NULL | NULL   |      | BTREE      |         |
| traffic_test |          0 | PRIMARY        |            4 | traffic_id  | A         |    24545609 |     NULL | NULL   |      | BTREE      |         |
| traffic_test |          1 | traffic_id_IDX |            1 | traffic_id  | A         |    24545609 |     NULL | NULL   |      | BTREE      |         |
+--------------+------------+----------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+

Run Code Online (Sandbox Code Playgroud)

我还通过不使用以下内容简化了查询OR：

SELECT SQL_NO_CACHE traffic.time, SUM(traffic.bytesin), SUM(traffic.bytesout) 
FROM    traffic
WHERE traffic.type LIKE 'v4_host' AND type_id IN (131,1974,1976,1514,1516,2767,2730,2731,2732,2733,2734,2769,2994,2709,1,4613,4614,4615,4616,326,1520,2652,1518,1521,1522,1523,1524,1525,2203,1515,1513,1467,1508,1973,1510,1975,1511,1475,1476,1468,1469,1470,1471,1472,1473,1500,1507,1478,1480,1481,1482,1483,1484,1485,1479,1486,1487,1488,1489,1490,1491,1495,1499,1494,2269,1474,1519,2204,2976,1922,1493,1492,1497,1496,1498,1501,1502,1503,1526,1509,1506) 
AND traffic.time >= 1342181721 
AND traffic.time < 1343391321 
GROUP BY traffic.time ASC;

Run Code Online (Sandbox Code Playgroud)

此查询的旧执行：

3980 rows in set (6 min 15.27 sec)

Run Code Online (Sandbox Code Playgroud)

新的执行时间：

3980 rows in set (24.80 sec)

Run Code Online (Sandbox Code Playgroud)

解释输出：

+----+-------------+---------+-------+---------------+---------+---------+------+----------+-------------+
| id | select_type | table   | type  | possible_keys | key     | key_len | ref  | rows     | Extra       |
+----+-------------+---------+-------+---------------+---------+---------+------+----------+-------------+
|  1 | SIMPLE      | traffic | range | PRIMARY       | PRIMARY | 4       | NULL | 12272804 | Using where |
+----+-------------+---------+-------+---------------+---------+---------+------+----------+-------------+

Run Code Online (Sandbox Code Playgroud)

行值仍然很高。我想我可以通过切换索引中的type和的顺序来改进这一点，type_id因为只有 4 种可能的类型和更多的 type_id。

这是一个正确的假设吗？

Answer 1

MTI*_*hai 6

1.表分区

由于 [AND traffic.time >= 1343772000 AND traffic.time < 1346450399] 子句，我想您永远不会从该表中删除数据，或者该表当前存储了多个月的数据。[time] 列中的值似乎是 unix 时间戳（1346450399 = Fri, 31 Aug 2012 21:59:59 GMT）根据时间列对表进行分区。这将加快数据检索，因为数据库将扫描相应的分区（比扫描整个表快得多）。

一个很棒的分区教程可以在这里找到：http : //www.arachna.com/roller/spidaman/entry/scaling_rails_with_mysql_table
您需要为此计算时间戳范围，但这应该不难。
例如：(1346450399 - 1343772000) / 60 / 60 / 24 =~ 31 天。因此，保存 9 月数据的分区的最大值（也有 31 天）将是：1346450399 + ( 31 * 24 * 60 * 60)
可以在此处找到 unix 最新计算器：http : //www.onlineconversion.com/unix_time.htm

2. 重写查询

因为“OR”在你的 WHERE 块中，优化器会选择不使用定义的索引。尝试将查询拆分为 2 个选择，并进行联合。

SELECT 
    traffic.time, 
    SUM(traffic.bytesin), 
    SUM(traffic.bytesout) 
FROM 
    traffic 
WHERE traffic.type LIKE 'v4_assignment' 
    AND type_id IN (1,2,3,4)
    AND traffic.time >= 1343772000 AND traffic.time <= 1346450399 
GROUP BY 
    traffic.time
UNION
SELECT 
    traffic.time, 
    SUM(traffic.bytesin), 
    SUM(traffic.bytesout) 
FROM 
    traffic 
WHERE traffic.type LIKE 'v4_host' 
    AND type_id IN (5,6,7,8)
    AND traffic.time >= 1343772000 AND traffic.time <= 1346450399 
GROUP BY 
    traffic.time
ORDER BY 
    traffic.time

Run Code Online (Sandbox Code Playgroud)

3.基于数据基数的新索引

根据您的解释输出，我没有看到正在使用的索引。也许是因为优化器决定进行全表扫描然后跟踪索引会更容易（更便宜）。此外，在您当前的索引中，第一列的基数低于接下来的 2 列。任何索引中的第一列应该是具有最佳（最大）基数的列。

创建一个新索引为：

MYSQL> CREATE INDEX MTIhai_traffic_idx1 ON traffic(time, type, type_id)

Run Code Online (Sandbox Code Playgroud)

归档时间：	13 年，2 月前
查看次数：	16513 次
最近记录：	13 年，1 月前