mysql:内连接需要 3 分钟

hea*_*ing 5 mysql performance

我在 SO for postgres 上有一个类似的问题 - 现在 mysql 也有同样的问题。

我有两张桌子——

表A:1MM行,AsOfDate,Id,BId(表B的外键)

表 B:50k 行、Id、Flag、ValidFrom、ValidTo

表 A 包含 2011/01/01 和 2011/12/31 之间每天跨 100 个 BId 的多条记录。表 B 包含 100 个投标的多个非重叠(在 validfrom 和 validto 之间)记录。

连接的任务是返回在给定 AsOfDate 上为 BId 激活的标志。

select 
    a.AsOfDate, b.Flag 
from 
    A a inner Join B b on 
        a.BId = b.Id and b.ValidFrom <= a.AsOfDate and b.ValidTo >= a.AsOfDate
where
    a.AsOfDate >= 20110101 and a.AsOfDate <= 20111231
Run Code Online (Sandbox Code Playgroud)

在具有 64Gb 内存的非常高端的服务器 (+3Ghz) 上,此查询需要 3 多分钟。

+-------+-------------------------+
| Table | Create Table            
|
+-------+-------------------------+
| a     | CREATE TABLE `a` (
  `asofdate` int(4) NOT NULL,
  `bid` int(4) NOT NULL,
  KEY `asofdate_bid` (`asofdate`,`bid`),
  KEY `bid` (`bid`),
  KEY `bid_asofdate` (`bid`,`asofdate`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 |
+-------+-------------------------+

+-------+-------------------------+
| Table | Create Table            |
+-------+-------------------------+
| b     | CREATE TABLE `b` (
  `key` int(4) NOT NULL,
  `id` int(4) NOT NULL,
  `flag` char(1) NOT NULL,
  `validfrom` int(4) NOT NULL,
  `validto` int(4) NOT NULL,
  KEY `id` (`id`),
  KEY `validfrom` (`validfrom`),
  KEY `validfrom_id` (`validfrom`,`id`),
  KEY `id_validfrom` (`id`,`validfrom`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 |
+-------+-------------------------+
Run Code Online (Sandbox Code Playgroud)

这是解释:

mysql> explain select count(1) from a a inner join b b on a.bid = b.id and b.validfrom <= a.asofdate and b.validto >= a.asofdate where a.asofdate >= 20120101 and a.asofdate <= 20121231;

+----+-------------+-------+------+----------------------------------------+--------------+---------+----------+-------+-----------+
| id | select_type | table | type | possible_keys                          | key          | key_len | ref      | rows  | Extra
+----+-------------+-------+------+----------------------------------------+--------------+---------+----------+-------+-----------+
|  1 | SIMPLE      | b     | ALL  | id,validfrom,validfrom_id,id_validfrom | NULL         | NULL    | NULL     | 50510 |                          |
|  1 | SIMPLE      | a     | ref  | asofdate_bid,bid,bid_asofdate          | bid_asofdate | 4       | foo.b.id |  1433 | Using where; Using index |
+----+-------------+-------+------+----------------------------------------+--------------+---------+----------+-------+-----------+
Run Code Online (Sandbox Code Playgroud)

SqlServer express 和 Postgres 需要大约 300 毫秒来执行上述查询。我正在决定一个多 TB 的安装,目前它对 mySql(我的首选数据库)来说并不好看!

建议查询的执行计划

删除连接条件(3 分钟):

mysql> EXPLAIN SELECT count(1) FROM a a 
    -> INNER JOIN b b ON a.bid = b.id 
    -> WHERE (a.asofdate >= 20120101 and a.asofdate <= 20121231) 
    ->  AND (b.validfrom <= a.asofdate AND b.validto >= a.asofdate);
+----+-------------+-------+------+----------------------------------------+--------------+---------+----------+-------+--------------------------+
| id | select_type | table | type | possible_keys                          | key          | key_len | ref      | rows  | Extra                    |
+----+-------------+-------+------+----------------------------------------+--------------+---------+----------+-------+--------------------------+
|  1 | SIMPLE      | b     | ALL  | id,validfrom,validfrom_id,id_validfrom | NULL         | NULL    | NULL     | 50510 |                          |
|  1 | SIMPLE      | a     | ref  | asofdate_bid,bid,bid_asofdate          | bid_asofdate | 4       | foo.b.id |  1433 | Using where; Using index |
+----+-------------+-------+------+----------------------------------------+--------------+---------+----------+-------+--------------------------+
2 rows in set (0.02 sec)
Run Code Online (Sandbox Code Playgroud)

使用直接连接实际上改变了查询计划并使时间变为 6 分钟:

mysql> EXPLAIN SELECT count(1) FROM a a  STRAIGHT_JOIN b b ON a.bid = b.id  WHERE (a.asofdate >= 20120101 and a.asofdate <= 20121231)   AND (b.validfrom <= a.asofdate AND b.validto >= a.asofdate);
+----+-------------+-------+-------+----------------------------------------+--------------+---------+-----------+--------+--------------------------+
| id | select_type | table | type  | possible_keys                          | key          | key_len | ref       | rows   | Extra                    |
+----+-------------+-------+-------+----------------------------------------+--------------+---------+-----------+--------+--------------------------+
|  1 | SIMPLE      | a     | range | asofdate_bid,bid,bid_asofdate          | asofdate_bid | 4       | NULL      | 500296 | Using where; Using index |
|  1 | SIMPLE      | b     | ref   | id,validfrom,validfrom_id,id_validfrom | id           | 4       | foo.a.bid |    255 | Using where              |
+----+-------------+-------+-------+----------------------------------------+--------------+---------+-----------+--------+--------------------------+
Run Code Online (Sandbox Code Playgroud)

Rol*_*DBA 5

这是您的原始查询

select 
    a.AsOfDate, b.Flag 
from 
    A a inner Join B b on 
        a.BId = b.Id and b.ValidFrom <= a.AsOfDate and b.ValidTo >= a.AsOfDate
where
    a.AsOfDate >= 20110101 and a.AsOfDate <= 20111231
Run Code Online (Sandbox Code Playgroud)

我建议在这种情况下重构您的查询:

select 
    a.AsOfDate, b.Flag 
from
    (
        select * from A
        WHERE AsOfDate >= 20110101
        AND AsOfDate <= 20111231
    ) a INNER JOIN B b ON a.bid=b.id
    AND b.validfrom <= a.asofdate
    AND b.validto   >= a.asofdate
;
Run Code Online (Sandbox Code Playgroud)

这样,在 JOIN 之前首先处理A 方的日期范围 ( 20110101- 20111231)。重构查询的另一个好处是JOINA 和 B 的 涉及 A 的较小子集。

如果你对重构后的查询感到不舒服,这里有另一个建议:切换基于范围的WHEREJOIN子句

select 
    a.AsOfDate, b.Flag 
from 
    A a inner Join B b on 
        a.BId = b.Id and a.AsOfDate >= 20110101 and a.AsOfDate <= 20111231
where
    b.ValidFrom <= a.AsOfDate and b.ValidTo >= a.AsOfDate
Run Code Online (Sandbox Code Playgroud)

试一试 !!!


Der*_*ney 2

我的猜测是你的连接条件混淆了 MySQL 优化器,正如解释所示,它正在加载整个b表。这给你带来了什么:

EXPLAIN SELECT count(1) FROM a a 
INNER JOIN b b ON a.bid = b.id 
WHERE (a.asofdate >= 20120101 and a.asofdate <= 20121231) 
 AND (b.validfrom <= a.asofdate AND b.validto >= a.asofdate);
Run Code Online (Sandbox Code Playgroud)

旁注,您不应该需要KEY (bid)表 A 上的表,因为KEY bid_asofdate (bid, asofdate)它将处理这个问题,并且 InnoDB 处理索引的方式,这只会占用比所需更多的空间。

关于索引的一些进一步的漫谈。为什么不在任何表中定义主键?我会b像这样更新你的表:

CREATE TABLE `b` (
  `key` int(4) NOT NULL PRIMARY KEY,
  `id` int(4) NOT NULL,
  `flag` char(1) NOT NULL,
  `validfrom` int(4) NOT NULL,
  `validto` int(4) NOT NULL,
  KEY `validfrom_id` (`validfrom`,`id`),
  KEY `id_validfrom_validto` (`id`,`validfrom`, `validto`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
Run Code Online (Sandbox Code Playgroud)

假设它id实际上不是主键并且key实际上是有用的:)

  • @headsling 我总是发现最好设置尽可能接近真实场景的测试。在性能方面尤其如此。 (4认同)
  • 完全同意@dezso。我不会费心花时间解决涉及没有主键的表的问题。对于InnoDB来说,聚集索引(通常是主键)的选择对于许多查询的性能和执行计划至关重要。 (4认同)
  • @headsling 如果我是你,我会尝试同样的 PK,而不是简单的矛盾。只有这样你才会知道这些是否无关紧要。(如果每个人似乎都在高速公路的错误一侧行驶,您怀疑什么?) (4认同)
  • 我怀疑在这种情况下mysql优化器会被`join`混淆(只有2个表,`inner join`,没有`OR`);除非优化器存在严重错误,否则无论条件的顺序和位置如何,它都应该生成相同的计划...... (2认同)