大 INSERT INTO SELECT [..] FROM 逐渐变慢

Dan*_*ett 3 mysql performance insert query-performance

我编写了一个程序,可以INSERT批量处理 100,000 次并显示其进度

源表包含 2.5GB 的数据:

CREATE TABLE wikt.text (
    old_id INT(10) UNSIGNED NOT NULL AUTO_INCREMENT,
    old_text MEDIUMBLOB NOT NULL,
    old_flags TINYBLOB NOT NULL,

    PRIMARY KEY (old_id),
    KEY old_id (old_id)

)   ENGINE=INNODB
    AUTO_INCREMENT=23565544
    DEFAULT CHARSET=binary;

CREATE INDEX old_id ON text (old_id);
Run Code Online (Sandbox Code Playgroud)

这是目标表:

CREATE TABLE domains.dictionary_language (
    text_id     INT(10) UNSIGNED NOT NULL AUTO_INCREMENT,
    english     TINYINT(1) UNSIGNED NOT NULL,

    PRIMARY KEY (text_id),
    KEY         english (english)

)   ENGINE=INNODB
    AUTO_INCREMENT=23565544;
Run Code Online (Sandbox Code Playgroud)

这是以 100k 为批次运行的查询:

INSERT INTO domains.dictionary_language
    SELECT      old_id,
                IF(old_text LIKE '%==English==%', 1, 0)

    FROM        wikt.text

    LIMIT       {batch}, 100000;
Run Code Online (Sandbox Code Playgroud)

查询越来越慢。在 8 分钟内插入的前 100 万条记录。之后,仅记录#1.2m - #1.3m花了 7 分钟。现在#2.3m - #2.4m 刚刚在 15 分钟内完成。

这是每批 100k 导入所需时间的日志。

正如您所看到的,前 12 个批次(120 万条记录)在每个 < 1 分钟内插入。之后,性能下降,每批比上一批花费的时间更长!

24/11/2013 19:18:40 Ready

24/11/2013 19:18:42 Dictionary import started from Wiktionary
24/11/2013 19:18:42 Records:    3,729,613
24/11/2013 19:18:42 Batches of: 100,000

24/11/2013 19:19:11 Batch 1 finished in 00:00:29.3146767
24/11/2013 19:19:33 Batch 2 finished in 00:00:22.2142706
24/11/2013 19:19:41 Batch 3 finished in 00:00:07.6104353
24/11/2013 19:19:53 Batch 4 finished in 00:00:12.7057267
24/11/2013 19:20:08 Batch 5 finished in 00:00:14.9248537
24/11/2013 19:20:25 Batch 6 finished in 00:00:16.9849715
24/11/2013 19:20:43 Batch 7 finished in 00:00:17.7930177
24/11/2013 19:20:49 Batch 8 finished in 00:00:06.2453572
24/11/2013 19:21:07 Batch 9 finished in 00:00:17.2549869
24/11/2013 19:21:38 Batch 10 finished in 00:00:31.4577993
24/11/2013 19:22:02 Batch 11 finished in 00:00:23.7003556
24/11/2013 19:22:17 Batch 12 finished in 00:00:15.4078813
24/11/2013 19:23:40 Batch 13 finished in 00:01:22.9637452
24/11/2013 19:25:25 Batch 14 finished in 00:01:44.8639979
24/11/2013 19:27:40 Batch 15 finished in 00:02:15.1387295
24/11/2013 19:30:07 Batch 16 finished in 00:02:26.7553939
24/11/2013 19:33:01 Batch 17 finished in 00:02:54.3109701
24/11/2013 19:36:17 Batch 18 finished in 00:03:15.8252006
24/11/2013 19:39:57 Batch 19 finished in 00:03:40.1275906
24/11/2013 19:44:28 Batch 20 finished in 00:04:30.3824650
24/11/2013 19:49:48 Batch 21 finished in 00:05:20.6873423
24/11/2013 19:55:45 Batch 22 finished in 00:05:56.7674059
24/11/2013 20:02:37 Batch 23 finished in 00:06:52.0925703
24/11/2013 20:10:32 Batch 24 finished in 00:07:54.8921622
24/11/2013 20:18:12 Batch 25 finished in 00:07:39.9433072
24/11/2013 20:26:34 Batch 26 finished in 00:08:21.4696824
24/11/2013 20:36:00 Batch 27 finished in 00:09:26.3163915
24/11/2013 20:45:07 Batch 28 finished in 00:09:07.1472950
24/11/2013 20:54:48 Batch 29 finished in 00:09:41.0222326
24/11/2013 21:04:19 Batch 30 finished in 00:09:31.2316726
24/11/2013 21:14:35 Batch 31 finished in 00:10:15.3521962
24/11/2013 21:25:10 Batch 32 finished in 00:10:34.9583176
24/11/2013 21:36:27 Batch 33 finished in 00:11:17.6047568
24/11/2013 21:47:52 Batch 34 finished in 00:11:24.3261412
24/11/2013 21:59:32 Batch 35 finished in 00:11:40.2410515
24/11/2013 22:12:31 Batch 36 finished in 00:12:59.1605654
24/11/2013 22:26:12 Batch 37 finished in 00:13:40.9209540
24/11/2013 22:40:12 Batch 38 finished in 00:13:59.8160347

24/11/2013 22:40:12 Dictionary import finished
Run Code Online (Sandbox Code Playgroud)

每批100k越来越慢!为什么?这是EXPLAIN 输出

服务器本身使用 4GB 缓冲池等进行了调整。

ype*_*eᵀᴹ 11

大偏移量可以产生这种效果。我会尝试删除偏移量并仅使用LIMIT 10000

INSERT INTO domains.dictionary_language
  (text_id, english)
    SELECT      t.old_id,
                IF(t.old_text LIKE '%==English==%', 1, 0)

    FROM        wikt.text AS t
      JOIN      ( SELECT COALESCE(MAX(text_id), 0) AS offset
                  FROM domains.dictionary_language
                ) AS m
                ON  t.old_id > m.offset

    ORDER BY    t.old_id
    LIMIT       100000;
Run Code Online (Sandbox Code Playgroud)