JOIN 子句中分组 MAX 的性能问题

Hen*_*los 5 mysql innodb index mysql-5.7

问题

我的应用程序中有一些资产会不时以异步方式更新。

我要在这里使用的例子是Vehicles. 有两个表:

  • Vehicles保存有关车辆本身的信息
  • VehicleUpdates保存有关该车辆发生的所有更新的信息。

表结构的相关部分是:

CREATE TABLE `Vehicles` (
  `id` varchar(50) NOT NULL,
  `organizationId` varchar(50) NOT NULL,
  `plate` char(7) NOT NULL,
  `vehicleInfo` json DEFAULT NULL,
  `createdAt` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
  `updatedAt` timestamp NULL DEFAULT NULL ON UPDATE CURRENT_TIMESTAMP,
  PRIMARY KEY (`id`),
  UNIQUE KEY `unq_Vehicles_orgId_plate_idx` (`organizationId`,`plate`) USING BTREE,
  KEY `Vehicles_createdAt_idx` (`createdAt`),
);

CREATE TABLE `VehicleUpdates` (
  `id` varchar(50) NOT NULL,
  `organizationId` varchar(50) NOT NULL,
  `vehiclePlate` char(7) NOT NULL,
  `status` varchar(15) NOT NULL,
  `createdAt` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
  `updatedAt` timestamp NULL DEFAULT NULL ON UPDATE CURRENT_TIMESTAMP,
  PRIMARY KEY (`id`),
  KEY `VehicleUpdates_orgId_vhclPlt_createdAt_idx` (`organizationId`,`vehiclePlate`,`createdAt`) USING BTREE
);
Run Code Online (Sandbox Code Playgroud)

现在我有一个新要求,我必须返回最新的更新信息以及车辆信息本身

Groupwise MAX 解决方案

经过一番挖掘后,我发现了这篇博客文章。然后我尝试使用建议的“不相关子查询”方法,因为它被认为是最好的方法:

不相关子查询

SELECT  vu1.*
    FROM VehicleUpdates AS vu1
    JOIN
      ( SELECT  vehiclePlate, organizationId, MAX(createdAt) AS createdAt
            FROM  VehicleUpdates
            GROUP BY organizationId, vehiclePlate
      ) AS vu2 USING (organizationId, vehiclePlate, createdAt);
Run Code Online (Sandbox Code Playgroud)

该查询在我的生产数据库中的平均执行时间为275 ms

我认为这太慢了,所以我决定尝试一下“LEFT JOIN”方法:

哑巴:左加入

SELECT  vu1.*
    FROM  VehicleUpdates AS vu1
    LEFT JOIN  VehicleUpdates AS vu2 ON vu1.organizationId = vu2.organizationId and vu1.vehiclePlate = vu2.vehiclePlate
      AND  vu2.createdAt > vu1.createdAt
    WHERE  vu2.id IS NULL;
Run Code Online (Sandbox Code Playgroud)

这个性能更好,平均执行时间为40 ms. 对我来说足够好了。

然后我需要运行此查询作为表查询的一部分Vehicles

当前结果

以下查询可以满足我的要求:

SELECT  v.*, vu1.*
FROM  Vehicles AS v
LEFT JOIN VehicleUpdates AS vu1 
    ON v.plate = vu1.vehiclePlate 
        AND v.organizationId = vu1.organizationId
LEFT JOIN  VehicleUpdates AS vu2 
    ON vu1.organizationId = vu2.organizationId 
        AND vu1.vehiclePlate = vu2.vehiclePlate
        AND  vu2.createdAt > vu1.createdAt
WHERE  vu2.id IS NULL;
Run Code Online (Sandbox Code Playgroud)

问题是它需要20 s(!)才能运行。大问题!

但我从来没有对生产进行全表扫描。该查询始终仅限于单个查询organizationId并且是分页的,因此我每页最多返回 100 行,因此我运行了以下查询:

SELECT  v.*, vu1.*
FROM  Vehicles AS v
LEFT JOIN VehicleUpdates AS vu1 
    ON v.plate = vu1.vehiclePlate 
        AND v.organizationId = vu1.organizationId
LEFT JOIN  VehicleUpdates AS vu2 
    ON vu1.organizationId = vu2.organizationId 
        AND vu1.vehiclePlate = vu2.vehiclePlate
        AND  vu2.createdAt > vu1.createdAt
WHERE vu2.id IS NULL
    and v.organizationId = '<some organization ID>'
LIMIT 100;
Run Code Online (Sandbox Code Playgroud)

现在需要从750 ms11 s运行,具体取决于关联的车辆数量。还不够好。

运行explain上面的查询让我得到:

"select_type" | "table" | "type" | "possible_keys"                                        | "key"                                      | "key_len" | "ref"                               | "rows" | "filtered" | "Extra"
SIMPLE        | v       | ref    | unq_Vehicles_orgId_plate_idx,Vehicles_orgId_status_idx | unq_Vehicles_orgId_plate_idx               | "202"     | const                               | 30     | 100        |
SIMPLE        | vu1     | ALL    |                                                        |                                            |           |                                     | 263171 | 100        | Using where; Using join buffer (Block Nested Loop)
SIMPLE        | vu2     | ref    | VehicleUpdates_orgId_vhclPlt_createdAt_idx             | VehicleUpdates_orgId_vhclPlt_createdAt_idx | "173"     | vu1.organizationId,vu1.vehiclePlate | 10     | 10         | Using where; Not exists; Using index
Run Code Online (Sandbox Code Playgroud)

令我印象深刻的是,该vu1表正在运行全表扫描,即使最左边的表Vehicles正在使用索引列进行过滤organizationId,该列也在 中进行索引VehicleUpdates

所以我决定再次尝试“不相关子查询”并运行:

SELECT  v.*, vu.*
FROM  Vehicles AS v
LEFT JOIN (
    SELECT  vu1.*
        FROM VehicleUpdates AS vu1
        JOIN
          ( SELECT  vehiclePlate, organizationId, MAX(createdAt) AS createdAt
                FROM  VehicleUpdates
                GROUP BY organizationId, vehiclePlate
          ) AS vu2 USING (organizationId, vehiclePlate, createdAt)
) AS vu 
    ON vu.organizationId = v.organizationId 
        AND vu.vehiclePlate = v.plate
WHERE v.organizationId = '<SOME ORGANIZATION ID>'
LIMIT 100;
Run Code Online (Sandbox Code Playgroud)

这次执行时间从1.4 s到不等,具体取决于给定的表13 s中有多少条目。我的申请不可接受。VehiclesorganizationId

跑步explain让我:

| "select_type" | "table"        | "type" | "possible_keys"                            | "key"                                      | "key_len" | "ref"                                             | "rows" | "filtered" | "Extra"
| PRIMARY       | v              | ALL    |                                            |                                            |           |                                                   | 14456  | 100        |
| PRIMARY       | <derived3>     | ALL    |                                            |                                            |           |                                                   | 29289  | 100        | Using where
| PRIMARY       | vu1            | ref    | VehicleUpdates_orgId_vhclPlt_createdAt_idx | VehicleUpdates_orgId_vhclPlt_createdAt_idx | "327"     | vu2.organizationId,vu2.vehiclePlate,vu2.createdAt | 1      | 100        | Using where
| DERIVED       | VehicleUpdates | range  | VehicleUpdates_orgId_vhclPlt_createdAt_idx | VehicleUpdates_orgId_vhclPlt_createdAt_idx | "323"     |                                                   | 29289  | 100        | Using index for group-by
Run Code Online (Sandbox Code Playgroud)

当前结果 - 已更新

我注意到添加特定organizationId子句可以提高性能。

左连接

跑步:

SELECT  v.*, vu1.*
FROM  Vehicles AS v
LEFT JOIN VehicleUpdates AS vu1 
    ON v.plate = vu1.vehiclePlate 
        AND v.organizationId = vu1.organizationId
        AND vu1.organizationId = '<SOME ORGANIZATION ID>' -- <--------
LEFT JOIN  VehicleUpdates AS vu2 
    ON vu1.organizationId = vu2.organizationId 
        AND vu1.vehiclePlate = vu2.vehiclePlate
        AND vu2.createdAt > vu1.createdAt
WHERE vu2.id IS NULL
    and v.organizationId = '<SOME ORGANIZATION ID>' -- <-----------
LIMIT 100;
Run Code Online (Sandbox Code Playgroud)

我得到的执行时间从65 ms(可接受)到2.5 s(不可接受)不等。

不相关查询

organizationId = '<SOME ORGANIZATION ID>'在“主”查询和连接外部子查询中放置一个子句:

SELECT  v.*, vu.*
FROM  Vehicles AS v
LEFT JOIN (
    SELECT  vu1.*
        FROM VehicleUpdates AS vu1
        JOIN
          ( SELECT  vehiclePlate, organizationId, MAX(createdAt) AS createdAt
                FROM  VehicleUpdates
                GROUP BY organizationId, vehiclePlate
          ) AS vu2 ON vu1.organizationId = vu2.organizationId
                and vu1.vehiclePlate = vu2.vehiclePlate
                and vu1.createdAt = vu2.createdAt
        WHERE organizationId = '<SOME ORGANIZATION ID>' -- <--------
    ) AS vu 
        ON vu.organizationId = v.organizationId 
            AND vu.vehiclePlate = v.plate
where
    v.organizationId = '<SOME ORGANIZATION ID>' -- <---------
LIMIT 100;
Run Code Online (Sandbox Code Playgroud)

我得到的执行时间从450 ms(不可接受)到900 ms(不可接受)不等。

organizationId = '<SOME ORGANIZATION ID>'在“主”查询和连接内部子查询中放置一个子句:

SELECT  v.*, vu.*
FROM  Vehicles AS v
LEFT JOIN (
    SELECT  vu1.*
        FROM VehicleUpdates AS vu1
        JOIN
          ( SELECT  vehiclePlate, organizationId, MAX(createdAt) AS createdAt
                FROM  VehicleUpdates
                WHERE organizationId = '<SOME ORGANIZATION ID>' -- <--------
                GROUP BY organizationId, vehiclePlate
          ) AS vu2 ON vu1.organizationId = vu2.organizationId
                and vu1.vehiclePlate = vu2.vehiclePlate
                and vu1.createdAt = vu2.createdAt
    ) AS vu 
        ON vu.organizationId = v.organizationId 
            AND vu.vehiclePlate = v.plate
where
    v.organizationId = '<SOME ORGANIZATION ID>' -- <---------
LIMIT 100;
Run Code Online (Sandbox Code Playgroud)

我得到的执行时间从225 ms(可接受)到500 ms(不可接受)不等。


有没有更好的方法来处理此类查询?

数据库信息

  • MySQL
  • 版本:5.7.23-log (Amazon RDS)
  • 引擎:InnoDB

Hen*_*los 2

我觉得自己好傻!刚刚发现问题了。

由于某种原因,Vehicles并且VehicleUpdates不同的字符集utf8mb4utf8)。

EXPLAIN这就是为什么“不相关子查询”方法的结果在其步骤之一中进行全表扫描的原因:

| "select_type" | "table"        | "type" | "possible_keys"                            | "key"                                      | "key_len" | "ref"                                             | "rows" | "filtered" | "Extra"
| PRIMARY       | v              | ALL    |                                            |                                            |           |                                                   | 14456  | 100        |
| PRIMARY       | <derived3>     | ALL    |                                            |                                            |           |                                                   | 29289  | 100        | Using where
| PRIMARY       | vu1            | ref    | VehicleUpdates_orgId_vhclPlt_createdAt_idx | VehicleUpdates_orgId_vhclPlt_createdAt_idx | "327"     | vu2.organizationId,vu2.vehiclePlate,vu2.createdAt | 1      | 100        | Using where
| DERIVED       | VehicleUpdates | range  | VehicleUpdates_orgId_vhclPlt_createdAt_idx | VehicleUpdates_orgId_vhclPlt_createdAt_idx | "323"     |                                                   | 29289  | 100        | Using index for group-by
Run Code Online (Sandbox Code Playgroud)

转换为 后VehicleUpdatesutf8mb4结果EXPLAIN为:

| "select_type" | "table"        | "type" | "possible_keys"                            | "key"                                      | "key_len" | "ref"                                           | "rows" | "filtered" | "Extra"
| PRIMARY       | v              | ref    | Vehicles_orgId_status_idx                  | Vehicles_orgId_status_idx                  | "202"     | const                                           | 188    | 100        |
| PRIMARY       | <derived2>     | ref    | <auto_key1>                                | <auto_key1>                                | "230"     | v.plate,v.organizationId                        | 10     | 100        |
| PRIMARY       | vu1            | ref    | VehicleUpdates_orgId_vhclPlt_createdAt_idx | VehicleUpdates_orgId_vhclPlt_createdAt_idx | "234"     | v.organizationId,vu2.vehiclePlate,vu2.createdAt | 1      | 100        | Using where
| DERIVED       | VehicleUpdates | ref    | VehicleUpdates_orgId_vhclPlt_createdAt_idx | VehicleUpdates_orgId_vhclPlt_createdAt_idx | "202"     | const                                           | 24090  | 100        | Using where; Using index
Run Code Online (Sandbox Code Playgroud)

同样,“LEFT JOIN”方法执行计划更改为:

| "select_type" | "table" | "type" | "possible_keys"                                        | "key"                                      | "key_len" | "ref"                               | "rows" | "filtered" | "Extra"
| SIMPLE        | v       | ref    | unq_Vehicles_orgId_plate_idx,Vehicles_orgId_status_idx | unq_Vehicles_orgId_plate_idx               | "202"     | const                               | 30     | 100        |
| SIMPLE        | vu1     | ALL    |                                                        |                                            |           |                                     | 263171 | 100        | Using where; Using join buffer (Block Nested Loop)
| SIMPLE        | vu2     | ref    | VehicleUpdates_orgId_vhclPlt_createdAt_idx             | VehicleUpdates_orgId_vhclPlt_createdAt_idx | "173"     | vu1.organizationId,vu1.vehiclePlate | 10     | 10         | Using where; Not exists; Using index
Run Code Online (Sandbox Code Playgroud)

到:

| "select_type" | "table" | "type" | "possible_keys"                            | "key"                                      | "key_len" | "ref"                               | "rows" | "filtered" | "Extra"
| SIMPLE        | v       | ref    | Vehicles_orgId_status_idx                  | Vehicles_orgId_status_idx                  | "202"     | const                               | 188    | 100        |
| SIMPLE        | vu1     | ref    | VehicleUpdates_orgId_vhclPlt_createdAt_idx | VehicleUpdates_orgId_vhclPlt_createdAt_idx | "230"     | v.organizationId,v.plate            | 9      | 100        |
| SIMPLE        | vu2     | ref    | VehicleUpdates_orgId_vhclPlt_createdAt_idx | VehicleUpdates_orgId_vhclPlt_createdAt_idx | "230"     | vu1.organizationId,vu1.vehiclePlate | 9      | 10         | Using where; Not exists; Using index
Run Code Online (Sandbox Code Playgroud)

因此,现在不同查询的性能是:

左连接

| "select_type" | "table"        | "type" | "possible_keys"                            | "key"                                      | "key_len" | "ref"                                             | "rows" | "filtered" | "Extra"
| PRIMARY       | v              | ALL    |                                            |                                            |           |                                                   | 14456  | 100        |
| PRIMARY       | <derived3>     | ALL    |                                            |                                            |           |                                                   | 29289  | 100        | Using where
| PRIMARY       | vu1            | ref    | VehicleUpdates_orgId_vhclPlt_createdAt_idx | VehicleUpdates_orgId_vhclPlt_createdAt_idx | "327"     | vu2.organizationId,vu2.vehiclePlate,vu2.createdAt | 1      | 100        | Using where
| DERIVED       | VehicleUpdates | range  | VehicleUpdates_orgId_vhclPlt_createdAt_idx | VehicleUpdates_orgId_vhclPlt_createdAt_idx | "323"     |                                                   | 29289  | 100        | Using index for group-by
Run Code Online (Sandbox Code Playgroud)

总是在下面跑50 ms

内部查询中没有 WHERE 的不相关子查询:

| "select_type" | "table"        | "type" | "possible_keys"                            | "key"                                      | "key_len" | "ref"                                           | "rows" | "filtered" | "Extra"
| PRIMARY       | v              | ref    | Vehicles_orgId_status_idx                  | Vehicles_orgId_status_idx                  | "202"     | const                                           | 188    | 100        |
| PRIMARY       | <derived2>     | ref    | <auto_key1>                                | <auto_key1>                                | "230"     | v.plate,v.organizationId                        | 10     | 100        |
| PRIMARY       | vu1            | ref    | VehicleUpdates_orgId_vhclPlt_createdAt_idx | VehicleUpdates_orgId_vhclPlt_createdAt_idx | "234"     | v.organizationId,vu2.vehiclePlate,vu2.createdAt | 1      | 100        | Using where
| DERIVED       | VehicleUpdates | ref    | VehicleUpdates_orgId_vhclPlt_createdAt_idx | VehicleUpdates_orgId_vhclPlt_createdAt_idx | "202"     | const                                           | 24090  | 100        | Using where; Using index
Run Code Online (Sandbox Code Playgroud)

平均运行时间为300 ms.

内部查询中与 WHERE 不相关的子查询:

| "select_type" | "table" | "type" | "possible_keys"                                        | "key"                                      | "key_len" | "ref"                               | "rows" | "filtered" | "Extra"
| SIMPLE        | v       | ref    | unq_Vehicles_orgId_plate_idx,Vehicles_orgId_status_idx | unq_Vehicles_orgId_plate_idx               | "202"     | const                               | 30     | 100        |
| SIMPLE        | vu1     | ALL    |                                                        |                                            |           |                                     | 263171 | 100        | Using where; Using join buffer (Block Nested Loop)
| SIMPLE        | vu2     | ref    | VehicleUpdates_orgId_vhclPlt_createdAt_idx             | VehicleUpdates_orgId_vhclPlt_createdAt_idx | "173"     | vu1.organizationId,vu1.vehiclePlate | 10     | 10         | Using where; Not exists; Using index
Run Code Online (Sandbox Code Playgroud)

也总是在下面奔跑50 ms


我决定坚持使用“LEFT JOIN”方法,因为它允许我创建一个视图来表示内部查询,这样我就可以简化返回车辆的查询。

我无法使用“不相关子查询”来执行此操作,因为它需要WHERE organizationId = '<ORGANIZATION ID>'内部查询中的子句,因此视图不会那么高效。