提高大数据集的查询性能

ani*_*van 5 mysql

作为日常 cron 工作的一部分,我需要运行一个处理大量数据的查询。此数据与访问网站的访问者有关,并使用我们之前捕获的数据更新数据。

该查询依赖于 2 个派生表(本FROM节中的选择查询)来完成其工作——

SELECT  
  new_visits.visitor_id, new_visits.visit_id, new_visits.visit_first_action_time,
  new_visits.purchased as purchased,  
  ifnull(existing_visitors.purchased, 0) as existing_purchased 
FROM   

    ( SELECT          
        tv.visitor_id, tv.visit_id, tv.visit_first_action_time, 
        if(tc.idgoal=0,1,0) as purchased                       
      FROM 
        tbl_visit tv left outer join tbl_conversion tc         
      ON 
        tv.visit_id = tc.visit_id AND tc.idgoal = 0                       
      WHERE
        tv.idsite= 12 AND tv.visit_id >= 477256              
      ORDER BY tv.visit_id       
      LIMIT 1000 ) new_visits

   LEFT JOIN          

   ( SELECT 
       visitor_id, max(visit_seq) as visit_seq, purchased 
     FROM 
       tbl_last_input_visit where site_id = 12 
     GROUP BY visitor_id ) existing_visitors      

   ON new_visits.visitor_id = existing_visitors.visitor_id 

ORDER BY new_visits.visitor_id, new_visits.visit_id;
Run Code Online (Sandbox Code Playgroud)

对于较小的数据集,此查询工作正常。然而,随着数据的增加,慢慢地变得越来越慢。直到开始执行大约需要 30 秒(开始时大约需要 1.5 秒)。

查询计划如下——

+----+-------------+------------------------+-------+-----------------------------------------------------------------------------------+---------------+---------+-------------------+---------+---------------------------------+
| id | select_type | table                  | type  | possible_keys                                                                     | key           | key_len | ref               | rows    | Extra                           |
+----+-------------+------------------------+-------+-----------------------------------------------------------------------------------+---------------+---------+-------------------+---------+---------------------------------+
|  1 | PRIMARY     | <derived2>             | ALL   | NULL                                                                              | NULL          | NULL    | NULL              |    1000 | Using temporary; Using filesort |
|  1 | PRIMARY     | <derived3>             | ALL   | NULL                                                                              | NULL          | NULL    | NULL              |  705325 |                                 |
|  3 | DERIVED     | tbl_input_visit        | ref   | visitorid_seq,visitorid_idx                                                       | idvisitor_seq | 4       |                   |  490047 | Using where                     |
|  2 | DERIVED     | tv                     | range | PRIMARY,index_idsite_config_datetime,index_idsite_datetime,index_idsite_idvisitor | PRIMARY       | 4       | NULL              | 4781309 | Using where                     |
|  2 | DERIVED     | tc                     | ref   | PRIMARY                                                                           | PRIMARY       | 8       | tv.idvisit        |       1 | Using index                     |
+----+-------------+------------------------+-------+-----------------------------------------------------------------------------------+---------------+---------+-------------------+---------+---------------------------------+
Run Code Online (Sandbox Code Playgroud)

在这一点上,我探索的一个选项是创建临时表。但是,这样做的开销非常大。我也意识到由于这个查询依赖于派生表,MySQL 将无法重用任何底层索引。

以下是所涉及表的创建语句——

CREATE TABLE `tbl_last_input_visit` (
  `site_id` int(10) unsigned NOT NULL,
  `visitor_id` binary(8) NOT NULL,
  `visit_seq` int(10) unsigned NOT NULL,
  `purchase_cycle_seq` int(10) unsigned NOT NULL,
  `visit_in_cycle_seq` int(10) unsigned NOT NULL,
  `purchased` smallint(5) unsigned NOT NULL COMMENT 'l_ij',
  UNIQUE KEY `idvisitor_seq` (`site_id`,`idvisitor`,`visit_seq`),
  KEY `idvisitor_idx` (`site_id`,`idvisitor`)
) ENGINE=InnoDB

CREATE TABLE `tbl_log_visit` (
  `visit_id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `idsite` int(10) unsigned NOT NULL,
  `idvisitor` binary(8) NOT NULL,
  `visit_last_action_time` DATETIME,
  `config_id` int(10) unsigned NOT NULL,
  PRIMARY KEY (`visit_id`),
  KEY `index_idsite_config_datetime` (`site_id`,`config_id`,`visit_last_action_time`),
  KEY `index_idsite_datetime` (`site_id`,`visit_last_action_time`),
  KEY `index_idsite_idvisitor` (`site_id`,`visitor_id`)
) ENGINE=InnoDB

CREATE TABLE `tbl_log_conversion` (
  `visit_id` int(10) unsigned NOT NULL,
  `site_id` int(10) unsigned NOT NULL,
  `visitor_id` binary(8) NOT NULL,
  `idgoal` int(10) NOT NULL,
  `idorder` int(10) NOT NULL,
  PRIMARY KEY (`visit_id`,`idgoal`),
  UNIQUE KEY `unique_idsite_idorder` (`site_id`,`idorder`)
) ENGINE=InnoDB
Run Code Online (Sandbox Code Playgroud)

有什么方法可以提高这个查询的性能吗?

ani*_*van 1

因此,正如 @Dmitriy 提到的,根本问题与派生查询有关。基本上,当使用巨大的数据集时,派生表可能会导致很多麻烦,因为组成查询的表中的基础索引不可用于派生查询。

简而言之,如果您在和SELECT上编写派生查询,则和的索引不可用于派生查询。因此,如果从和返回的数据集很大,则生成的查询将非常慢。tblAtblBtblAtblBtblAtblB

我最终通过将派生查询分解为单独的查询并匹配应用程序层中的结果来修复解决方案。GROUP BY通过在对其中一个查询中的子句有贡献的列上设置索引,我还获得了相当大的性能提升。(非常感谢@strawberry!)