更新内部连接子查询的数百万条记录 - 优化技术

Ale*_*ger 7 mysql sql optimization mysql-5.6

我正在寻找一些关于如何更好地优化此查询的建议.

对于每条_piece_detail记录:

  1. 包含至少一个匹配_scan记录(zip,zip_4,zip_delivery_point,serial_number)
  2. 属于公司mailing_groups(通过一系列关系)
  3. 有:
    1. first_scan_date_time这大于MIN(scan_date_time)相关的_scan记录
    2. latest_scan_date_time这比MAX(scan_date_time)相关_scan记录少

我需要:

  1. 设置_piece_detail.first_scan_date_timeMIN(_scan.scan_date_time)
  2. 设置_piece_detail.latest_scan_date_timeMAX(_scan.scan_date_time)

由于我正在处理数百万条记录,因此我试图减少实际需要搜索的记录数.以下是有关数据的一些事实:

  1. 该_piece_details表由分区job_id,所以它似乎最有意义通过的顺序这些检查运行 _piece_detail.job_id,_piece_detail.piece_id.
  2. 扫描记录表现在包含超过100,000,000条记录,并按(zip,zip_4,zip_delivery_point,serial_number,scan_date_time)进行分区,这与用于匹配_scan和_piece_detail(除scan_date_time之外)的密钥相同.
  3. 只有大约40%的_piece_detail记录属于a mailing_group,但在我们完成连接的完整关系之前,我们不知道它们是哪些.
  4. 只有约30%的_scan记录属于_piece_detaila mailing_group.
  5. _scan每个通常有0到4个记录_piece_detail.

现在,我正在寻找一种以合适的方式执行此操作的方法.我最初是从这样的事情开始的:

UPDATE _piece_detail
    INNER JOIN (
        SELECT _piece_detail.job_id, _piece_detail.piece_id, MIN(_scan.scan_date_time) as first_scan_date_time, MAX(_scan.scan_date_time) as latest_scan_date_time
        FROM _piece_detail
            INNER JOIN _container_quantity 
                ON _piece_detail.cqt_database_id = _container_quantity.cqt_database_id 
                AND _piece_detail.job_id = _container_quantity.job_id
            INNER JOIN _container_summary 
                ON _container_quantity.container_id = _container_summary.container_id 
                AND _container_summary.job_id = _container_quantity.job_id
            INNER JOIN _mail_piece_unit 
                ON _container_quantity.mpu_id = _mail_piece_unit.mpu_id 
                AND _container_quantity.job_id = _mail_piece_unit.job_id
            INNER JOIN _header 
                ON _header.job_id = _piece_detail.job_id
            INNER JOIN mailing_groups 
                ON _mail_piece_unit.mpu_company = mailing_groups.mpu_company
            INNER JOIN _scan
                ON _scan.zip = _piece_detail.zip 
                AND _scan.zip_4 = _piece_detail.zip_4 
                AND _scan.zip_delivery_point = _piece_detail.zip_delivery_point 
                AND _scan.serial_number = _piece_detail.serial_number 
        GROUP BY _piece_detail.job_id, _piece_detail.piece_id, _scan.zip, _scan.zip_4, _scan.zip_delivery_point, _scan.serial_number
    ) as t1 ON _piece_detail.job_id = t1.job_id AND _piece_detail.piece_id = t1.piece_id 
SET _piece_detail.first_scan_date_time = t1.first_scan_date_time, _piece_detail.latest_scan_date_time = t1.latest_scan_date_time
WHERE _piece_detail.first_scan_date_time < t1.first_scan_date_time 
    OR _piece_detail.latest_scan_date_time > t1.latest_scan_date_time;
Run Code Online (Sandbox Code Playgroud)

我认为这可能是一次尝试加载到内存中太多,可能没有正确使用索引.

然后我想我可以避免做那个巨大的连接子查询并添加两个leftjoin子查询来获得min/max,如下所示:

UPDATE _piece_detail
    INNER JOIN _container_quantity 
        ON _piece_detail.cqt_database_id = _container_quantity.cqt_database_id 
        AND _piece_detail.job_id = _container_quantity.job_id
    INNER JOIN _container_summary 
        ON _container_quantity.container_id = _container_summary.container_id 
        AND _container_summary.job_id = _container_quantity.job_id
    INNER JOIN _mail_piece_unit 
        ON _container_quantity.mpu_id = _mail_piece_unit.mpu_id 
        AND _container_quantity.job_id = _mail_piece_unit.job_id
    INNER JOIN _header 
        ON _header.job_id = _piece_detail.job_id
    INNER JOIN mailing_groups 
        ON _mail_piece_unit.mpu_company = mailing_groups.mpu_company
    LEFT JOIN _scan fs ON (fs.zip, fs.zip_4, fs.zip_delivery_point, fs.serial_number) = (
        SELECT zip, zip_4, zip_delivery_point, serial_number
        FROM _scan
        WHERE zip = _piece_detail.zip 
            AND zip_4 = _piece_detail.zip_4 
            AND zip_delivery_point = _piece_detail.zip_delivery_point 
            AND serial_number = _piece_detail.serial_number
        ORDER BY scan_date_time ASC
        LIMIT 1
        )
    LEFT JOIN _scan ls ON (ls.zip, ls.zip_4, ls.zip_delivery_point, ls.serial_number) = (
        SELECT zip, zip_4, zip_delivery_point, serial_number
        FROM _scan
        WHERE zip = _piece_detail.zip 
            AND zip_4 = _piece_detail.zip_4 
            AND zip_delivery_point = _piece_detail.zip_delivery_point 
            AND serial_number = _piece_detail.serial_number
        ORDER BY scan_date_time DESC
        LIMIT 1
        )
SET _piece_detail.first_scan_date_time = fs.scan_date_time, _piece_detail.latest_scan_date_time = ls.scan_date_time
WHERE _piece_detail.first_scan_date_time < fs.scan_date_time 
    OR _piece_detail.latest_scan_date_time > ls.scan_date_time
Run Code Online (Sandbox Code Playgroud)

这些是我将它们转换为SELECT语句时的解释:

+----+-------------+---------------------+--------+----------------------------------------------------+---------------+---------+------------------------------------------------------------------------------------------------------------------------+--------+----------------------------------------------+
| id | select_type | table               | type   | possible_keys                                      | key           | key_len | ref                                                                                                                    | rows   | Extra                                        |
+----+-------------+---------------------+--------+----------------------------------------------------+---------------+---------+------------------------------------------------------------------------------------------------------------------------+--------+----------------------------------------------+
|  1 | PRIMARY     | <derived2>          | ALL    | NULL                                               | NULL          | NULL    | NULL                                                                                                                   | 844161 | NULL                                         |
|  1 | PRIMARY     | _piece_detail       | eq_ref | PRIMARY,first_scan_date_time,latest_scan_date_time | PRIMARY       | 18      | t1.job_id,t1.piece_id                                                                                                  |      1 | Using where                                  |
|  2 | DERIVED     | _header             | index  | PRIMARY                                            | date_prepared | 3       | NULL                                                                                                                   |     87 | Using index; Using temporary; Using filesort |
|  2 | DERIVED     | _piece_detail       | ref    | PRIMARY,cqt_database_id,zip                        | PRIMARY       | 10      | odms._header.job_id                                                                                                    |   9703 | NULL                                         |
|  2 | DERIVED     | _container_quantity | eq_ref | unique,mpu_id,job_id,job_id_container_quantity     | unique        | 14      | odms._header.job_id,odms._piece_detail.cqt_database_id                                                                 |      1 | NULL                                         |
|  2 | DERIVED     | _mail_piece_unit    | eq_ref | PRIMARY,company,job_id_mail_piece_unit             | PRIMARY       | 14      | odms._container_quantity.mpu_id,odms._header.job_id                                                                    |      1 | Using where                                  |
|  2 | DERIVED     | mailing_groups      | eq_ref | PRIMARY                                            | PRIMARY       | 27      | odms._mail_piece_unit.mpu_company                                                                                      |      1 | Using index                                  |
|  2 | DERIVED     | _container_summary  | eq_ref | unique,container_id,job_id_container_summary       | unique        | 14      | odms._header.job_id,odms._container_quantity.container_id                                                              |      1 | Using index                                  |
|  2 | DERIVED     | _scan               | ref    | PRIMARY                                            | PRIMARY       | 28      | odms._piece_detail.zip,odms._piece_detail.zip_4,odms._piece_detail.zip_delivery_point,odms._piece_detail.serial_number |      1 | Using index                                  |
+----+-------------+---------------------+--------+----------------------------------------------------+---------------+---------+------------------------------------------------------------------------------------------------------------------------+--------+----------------------------------------------+

+----+--------------------+---------------------+--------+--------------------------------------------------------------------+---------------+---------+------------------------------------------------------------------------------------------------------------------------+-----------+-----------------------------------------------------------------+
| id | select_type        | table               | type   | possible_keys                                                      | key           | key_len | ref                                                                                                                    | rows      | Extra                                                           |
+----+--------------------+---------------------+--------+--------------------------------------------------------------------+---------------+---------+------------------------------------------------------------------------------------------------------------------------+-----------+-----------------------------------------------------------------+
|  1 | PRIMARY            | _header             | index  | PRIMARY                                                            | date_prepared | 3       | NULL                                                                                                                   |        87 | Using index                                                     |
|  1 | PRIMARY            | _piece_detail       | ref    | PRIMARY,cqt_database_id,first_scan_date_time,latest_scan_date_time | PRIMARY       | 10      | odms._header.job_id                                                                                                    |      9703 | NULL                                                            |
|  1 | PRIMARY            | _container_quantity | eq_ref | unique,mpu_id,job_id,job_id_container_quantity                     | unique        | 14      | odms._header.job_id,odms._piece_detail.cqt_database_id                                                                 |         1 | NULL                                                            |
|  1 | PRIMARY            | _mail_piece_unit    | eq_ref | PRIMARY,company,job_id_mail_piece_unit                             | PRIMARY       | 14      | odms._container_quantity.mpu_id,odms._header.job_id                                                                    |         1 | Using where                                                     |
|  1 | PRIMARY            | mailing_groups      | eq_ref | PRIMARY                                                            | PRIMARY       | 27      | odms._mail_piece_unit.mpu_company                                                                                      |         1 | Using index                                                     |
|  1 | PRIMARY            | _container_summary  | eq_ref | unique,container_id,job_id_container_summary                       | unique        | 14      | odms._header.job_id,odms._container_quantity.container_id                                                              |         1 | Using index                                                     |
|  1 | PRIMARY            | fs                  | index  | NULL                                                               | updated       | 1       | NULL                                                                                                                   | 102462928 | Using where; Using index; Using join buffer (Block Nested Loop) |
|  1 | PRIMARY            | ls                  | index  | NULL                                                               | updated       | 1       | NULL                                                                                                                   | 102462928 | Using where; Using index; Using join buffer (Block Nested Loop) |
|  3 | DEPENDENT SUBQUERY | _scan               | ref    | PRIMARY                                                            | PRIMARY       | 28      | odms._piece_detail.zip,odms._piece_detail.zip_4,odms._piece_detail.zip_delivery_point,odms._piece_detail.serial_number |         1 | Using where; Using index; Using filesort                        |
|  2 | DEPENDENT SUBQUERY | _scan               | ref    | PRIMARY                                                            | PRIMARY       | 28      | odms._piece_detail.zip,odms._piece_detail.zip_4,odms._piece_detail.zip_delivery_point,odms._piece_detail.serial_number |         1 | Using where; Using index; Using filesort                        |
+----+--------------------+---------------------+--------+--------------------------------------------------------------------+---------------+---------+------------------------------------------------------------------------------------------------------------------------+-----------+-----------------------------------------------------------------+
Run Code Online (Sandbox Code Playgroud)

现在,看看每个产生的解释,我真的不知道哪个给了我最好的回报.第一个显示乘以行列时总行数较少,但第二个似乎执行得更快一点.

在通过修改查询结构来提高性能的同时,我可以做些什么来实现相同的结果?

Hit*_*ony 0

为什么不为每个连接使用子查询?包括内连接吗?

INNER JOIN (SELECT field1, field2, field 3 from _container_quantity order by 1,2,3) 
    ON _piece_detail.cqt_database_id = _container_quantity.cqt_database_id 
    AND _piece_detail.job_id = _container_quantity.job_id
INNER JOIN (SELECT field1, field2, field3 from _container_summary order by 1,2,3)
    ON _container_quantity.container_id = _container_summary.container_id 
    AND _container_summary.job_id = _container_quantity.job_id
Run Code Online (Sandbox Code Playgroud)

通过不限制对这些内部连接的选择,您肯定会在内存中吸收很多内容。通过在每个子查询末尾使用 order by 1,2,3,您可以在每个子查询上创建索引。您唯一的索引位于标题上,并且您没有加入 _headers....

优化此查询的一些建议。要么在每个表上创建您需要的索引,要么使用子查询连接子句动态手动创建您需要的索引。

还要记住,当您在充满聚合的“临时”表上执行左连接时,您只会带来性能问题。

在 (zip、zip_4、zip_delivery_point、serial_number) 上至少包含一条匹配的 _scan 记录

嗯...这是您想做的第一点,但这些字段都没有索引?