作为日常 cron 工作的一部分,我需要运行一个处理大量数据的查询。此数据与访问网站的访问者有关,并使用我们之前捕获的数据更新数据。
该查询依赖于 2 个派生表(本FROM
节中的选择查询)来完成其工作——
SELECT
new_visits.visitor_id, new_visits.visit_id, new_visits.visit_first_action_time,
new_visits.purchased as purchased,
ifnull(existing_visitors.purchased, 0) as existing_purchased
FROM
( SELECT
tv.visitor_id, tv.visit_id, tv.visit_first_action_time,
if(tc.idgoal=0,1,0) as purchased
FROM
tbl_visit tv left outer join tbl_conversion tc
ON
tv.visit_id = tc.visit_id AND tc.idgoal = 0
WHERE
tv.idsite= 12 AND tv.visit_id >= 477256
ORDER BY tv.visit_id
LIMIT 1000 ) new_visits
LEFT JOIN
( SELECT
visitor_id, max(visit_seq) as visit_seq, purchased
FROM
tbl_last_input_visit where site_id = 12
GROUP BY visitor_id ) existing_visitors
ON new_visits.visitor_id = existing_visitors.visitor_id
ORDER BY new_visits.visitor_id, new_visits.visit_id;
Run Code Online (Sandbox Code Playgroud)
对于较小的数据集,此查询工作正常。然而,随着数据的增加,慢慢地变得越来越慢。直到开始执行大约需要 30 秒(开始时大约需要 1.5 秒)。
查询计划如下——
+----+-------------+------------------------+-------+-----------------------------------------------------------------------------------+---------------+---------+-------------------+---------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------------------+-------+-----------------------------------------------------------------------------------+---------------+---------+-------------------+---------+---------------------------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 1000 | Using temporary; Using filesort |
| 1 | PRIMARY | <derived3> | ALL | NULL | NULL | NULL | NULL | 705325 | |
| 3 | DERIVED | tbl_input_visit | ref | visitorid_seq,visitorid_idx | idvisitor_seq | 4 | | 490047 | Using where |
| 2 | DERIVED | tv | range | PRIMARY,index_idsite_config_datetime,index_idsite_datetime,index_idsite_idvisitor | PRIMARY | 4 | NULL | 4781309 | Using where |
| 2 | DERIVED | tc | ref | PRIMARY | PRIMARY | 8 | tv.idvisit | 1 | Using index |
+----+-------------+------------------------+-------+-----------------------------------------------------------------------------------+---------------+---------+-------------------+---------+---------------------------------+
Run Code Online (Sandbox Code Playgroud)
在这一点上,我探索的一个选项是创建临时表。但是,这样做的开销非常大。我也意识到由于这个查询依赖于派生表,MySQL 将无法重用任何底层索引。
以下是所涉及表的创建语句——
CREATE TABLE `tbl_last_input_visit` (
`site_id` int(10) unsigned NOT NULL,
`visitor_id` binary(8) NOT NULL,
`visit_seq` int(10) unsigned NOT NULL,
`purchase_cycle_seq` int(10) unsigned NOT NULL,
`visit_in_cycle_seq` int(10) unsigned NOT NULL,
`purchased` smallint(5) unsigned NOT NULL COMMENT 'l_ij',
UNIQUE KEY `idvisitor_seq` (`site_id`,`idvisitor`,`visit_seq`),
KEY `idvisitor_idx` (`site_id`,`idvisitor`)
) ENGINE=InnoDB
CREATE TABLE `tbl_log_visit` (
`visit_id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`idsite` int(10) unsigned NOT NULL,
`idvisitor` binary(8) NOT NULL,
`visit_last_action_time` DATETIME,
`config_id` int(10) unsigned NOT NULL,
PRIMARY KEY (`visit_id`),
KEY `index_idsite_config_datetime` (`site_id`,`config_id`,`visit_last_action_time`),
KEY `index_idsite_datetime` (`site_id`,`visit_last_action_time`),
KEY `index_idsite_idvisitor` (`site_id`,`visitor_id`)
) ENGINE=InnoDB
CREATE TABLE `tbl_log_conversion` (
`visit_id` int(10) unsigned NOT NULL,
`site_id` int(10) unsigned NOT NULL,
`visitor_id` binary(8) NOT NULL,
`idgoal` int(10) NOT NULL,
`idorder` int(10) NOT NULL,
PRIMARY KEY (`visit_id`,`idgoal`),
UNIQUE KEY `unique_idsite_idorder` (`site_id`,`idorder`)
) ENGINE=InnoDB
Run Code Online (Sandbox Code Playgroud)
有什么方法可以提高这个查询的性能吗?
因此,正如 @Dmitriy 提到的,根本问题与派生查询有关。基本上,当使用巨大的数据集时,派生表可能会导致很多麻烦,因为组成查询的表中的基础索引不可用于派生查询。
简而言之,如果您在和SELECT
上编写派生查询,则和的索引不可用于派生查询。因此,如果从和返回的数据集很大,则生成的查询将非常慢。tblA
tblB
tblA
tblB
tblA
tblB
我最终通过将派生查询分解为单独的查询并匹配应用程序层中的结果来修复解决方案。GROUP BY
通过在对其中一个查询中的子句有贡献的列上设置索引,我还获得了相当大的性能提升。(非常感谢@strawberry!)
归档时间: |
|
查看次数: |
2567 次 |
最近记录: |