Par*_*lar 7 mysql performance query-performance
我工作的一个查询,我需要使用IN
,BETWEEN
,GROUP BY
,JOIN
,ORDER BY
都在同一个查询。我正在努力解决该查询的性能问题,因此如果索引没有帮助,我需要帮助来选择索引或更改表结构。
millions
.name
,age
,gender
等。age
,visits_count
等等。CREATE TABLE `table_1` (
`visitor_id` varchar(32) CHARACTER SET ascii NOT NULL,
`name` varchar(200) NOT NULL,
`gender` varchar(1) NOT NULL DEFAULT 'M',
`mobile_number` int(10) unsigned DEFAULT NULL,
`age` tinyint(1) unsigned NOT NULL DEFAULT '1',
`visits_count` mediumint(5) unsigned NOT NULL DEFAULT '0',
PRIMARY KEY (`visitor_id`),
KEY `indx_t1_test` (`visitor_id`,`visits_count`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
Run Code Online (Sandbox Code Playgroud)
CREATE TABLE `table_2` (
`company_id` bigint(20) unsigned NOT NULL,
`visitor_id` varchar(32) CHARACTER SET ascii NOT NULL,
`time_duration` mediumint(5) unsigned NOT NULL DEFAULT '0',
`visited_on` date NOT NULL,
PRIMARY KEY (`company_id`,`visitor_id`,`visited_on`),
KEY `indx_t2_test` (`visited_on`,`company_id`,`visitor_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
Run Code Online (Sandbox Code Playgroud)
想要获得 20 个(分页)在选定时间(部分)期间按年龄(部分)GROUP BY / DISTINCT
访问特定公司组(IN
部分)的唯一访问者()。BETWEEN
ORDER BY
如果我为此写下第一个查询,那么它将是:
SELECT
t1.visitor_id
FROM table_1 AS t1
INNER JOIN table_2 AS t2 ON t2.visitor_id = t1.visitor_id
WHERE
t2.company_id IN (528,211,1275,521,1299,493,492,852,868,869,1235,486,485,1238,855,1237,651,538,1241,1240,548,543,1247,1253,490,468,582,583,569,477,488,802,1294,518,1274,476,545,1267,556,479,1266,1265,541,1189,1263,1152,1260,478,1257,885,1139,1256,804,708,547,561,1239,1142,1226,1148,1230,529,1223,1192,1191,874,830,822,818,817,794,718,487,709,706,705,669,513,455) AND
t2.visited_on BETWEEN '2015-01-01' AND '2017-01-31'
GROUP BY t1.visitor_id
ORDER BY t1.`visits_count` DESC
LIMIT 20;
Run Code Online (Sandbox Code Playgroud)
当我为任何一家公司运行此查询时,它返回数据的速度足够快(当匹配行数较少时,查询性能良好)。
该问题是当企业增加数量在IN
查询的一部分(我需要支持100家公司进行查询的这个部分),它需要一定的时间大约36 seconds
要返回结果。
Explain
此查询的输出是:
对于同样的情况,我可以想到的第二个查询,那么它会是这样的:
SELECT
(
SELECT
t2.visitor_id
FROM table_2 AS t2
WHERE
t2.company_id IN (528,211,1275,521,1299,493,492,852,868,869,1235,486,485,1238,855,1237,651,538,1241,1240,548,543,1247,1253,490,468,582,583,569,477,488,802,1294,518,1274,476,545,1267,556,479,1266,1265,541,1189,1263,1152,1260,478,1257,885,1139,1256,804,708,547,561,1239,1142,1226,1148,1230,529,1223,1192,1191,874,830,822,818,817,794,718,487,709,706,705,669,513,455)
AND t2.visitor_id = t1.`visitor_id`
AND t2.visited_on BETWEEN '2015-01-01' AND '2017-01-31'
LIMIT 1
) AS visitor_id
FROM `table_1` AS t1
HAVING visitor_id IS NOT NULL
ORDER BY t1.`visits_count` DESC
LIMIT 0, 20
Run Code Online (Sandbox Code Playgroud)
此查询的行为与第一个相反。如果我对访问者很少的公司运行查询,则此查询的性能非常低(大约需要38 seconds
)(只有一家公司的IN
一部分,而该公司只有 3-4 名访问者)。当IN
部分公司数量较多时,与一家公司相比,它返回结果更快(大约需要13 seconds
),但仍然没有可用的性能。
Explain
此查询的输出是:
为了消除IN
部分查询的使用,我创建了临时表并在该表中添加了公司 ID,然后使用JOIN
:
SELECT
DISTINCT
t1.visitor_id
FROM `table_1` AS t1
INNER JOIN `table_2` AS t2 ON t1.`visitor_id` = t2.visitor_id
INNER JOIN temp_table AS t3 ON t3.company_id = t2.company_id
ORDER BY t1.`visits_count` DESC
LIMIT 0, 20;
Run Code Online (Sandbox Code Playgroud)
此查询也需要长达22 秒的时间。我需要2-3 seconds
此列表的性能。
innodb_buffer_pool_size
是 12GBdb.r3.xlarge
实例查询SELECT COUNT(*) FROM table_2 WHERE company_id IN (...) AND visited_on BETWEEN '2015-01-01' AND '2017-01-31'
返回2660123
第一次只是需要时间。如果我再次运行相同的查询,它会快得多(0.2 秒)。但是,如果我LIMIT 20, 20
再次将限制更改为then 它会重复24 seconds
第一次,并且第二次相同的查询会更快。可能是因为innodb_buffer_pool_size
。
的输出EXPLAIN FORMAT=JSON SELECT ...;
如下。
{
"query_block": {
"select_id": 1,
"ordering_operation": {
"using_filesort": true,
"grouping_operation": {
"using_temporary_table": true,
"using_filesort": false,
"nested_loop": [
{
"table": {
"table_name": "t2",
"access_type": "range",
"possible_keys": [
"PRIMARY",
"indx_t2_test"
],
"key": "PRIMARY",
"used_key_parts": [
"company_id"
],
"key_length": "8",
"rows": 17301,
"filtered": 100,
"using_index": true,
"attached_condition": "((`db`.`t2`.`company_id` in (528,211,1275,521,1299,493,492,852,868,869,1235,486,485,1238,855,1237,651,538,1241,1240,548,543,1247,1253,490,468,582,583,569,477,488,802,1294,518,1274,476,545,1267,556,479,1266,1265,541,1189,1263,1152,1260,478,1257,885,1139,1256,804,708,547,561,1239,1142,1226,1148,1230,529,1223,1192,1191,874,830,822,818,817,794,718,487,709,706,705,669,513,455)) and (`db`.`t2`.`visited_on` between '2015-01-01' and '2017-01-31'))"
}
},
{
"table": {
"table_name": "t1",
"access_type": "eq_ref",
"possible_keys": [
"PRIMARY",
"indx_t1_test"
],
"key": "PRIMARY",
"used_key_parts": [
"visitor_id"
],
"key_length": "34",
"ref": [
"db.t2.visitor_id"
],
"rows": 1,
"filtered": 100
}
}
]
}
}
}
}
Run Code Online (Sandbox Code Playgroud)Rick James 建议的查询输出:
SELECT
t2.visitor_id
FROM (
SELECT
DISTINCT visitor_id
FROM table_2
WHERE
company_id IN (528,211,1275,521,1299,493,492,852, 868,
869,1235,486,485,1238,855,1237,651,538,1241,1240, 548,
543,1247,1253,490,468,582,583,569,477,488,802,1294, 518,
1274,476,545,1267,556,479,1266,1265,541,1189,1263, 1152,
1260,478,1257,885,1139,1256,804,708,547,561,1239, 1142,
1226,1148,1230,529,1223,1192,1191,874,830,822,818, 817,
794,718,487,709,706,705,669,513,455)
AND visited_on BETWEEN '2015-01-01' AND '2017-01-31'
) AS t2
INNER JOIN table_1 AS t1 ON t2.visitor_id = t1.visitor_id
ORDER BY t1.`visits_count` DESC
LIMIT 20;
Run Code Online (Sandbox Code Playgroud)
Explain
查询的输出如下:
此查询需要 58 秒
Explain
内部子查询的输出如下
查询:
SELECT
COUNT(DISTINCT company_id, visited_on, visitor_id),
COUNT(DISTINCT company_id, LEFT(visited_on, 7), visitor_id),
COUNT(*)
FROM table_2;
Run Code Online (Sandbox Code Playgroud)
返回:
COUNT(DISTINCT company_id, visited_on, visitor_id)
= 7607938。COUNT(DISTINCT company_id, LEFT(visited_on, 7), visitor_id)
= 5194480COUNT(*)
= 7607938请注意,此输出是最新数据,因此行数count(*)
可能会增加。
age int(3) unsigned
- 这允许您存储高达 40 亿的年龄并浪费 4 个字节。更改为TINYINT UNSIGNED
(1 个字节)。
名称用 Ascii 表示吗?仅限美国?即便如此,还是不允许使用一些奇怪的名字。
我对 t2 感到困惑PRIMARY KEY
。由于 PK 是唯一的,因此不允许记录一个人多次访问一家公司。如果限制没问题,请添加以下内容(如果优化器确定数据范围是最佳过滤器):
INDEX(visited_on, conpany_id, visitor_id)
Run Code Online (Sandbox Code Playgroud)
如果我的预感是正确的,那么更改 PK 并添加索引:
PRIMARY KEY(`company_id`, `visitor_id`, visited_on),
INDEX(visited_on, conpany_id, visitor_id)
Run Code Online (Sandbox Code Playgroud)
然后检查您的各种查询。