IN、JOIN、GROUP BY、ORDER BY 查询的索引/优化

Par*_*lar 7 mysql performance query-performance

我工作的一个查询,我需要使用INBETWEENGROUP BYJOINORDER BY都在同一个查询。我正在努力解决该查询的性能问题,因此如果索引没有帮助,我需要帮助来选择索引或更改表结构。

一些注意事项

  1. 下面两个表的行数都在millions.
  2. 有功能,其中用户可以通过过滤列表nameagegender等。
  3. 有功能,其中一些指标,比如用户可以对列表进行排序agevisits_count等等。
  4. 列表需要分页。

表结构

表格1

CREATE TABLE `table_1` (
  `visitor_id` varchar(32) CHARACTER SET ascii NOT NULL,
  `name` varchar(200) NOT NULL,
  `gender` varchar(1) NOT NULL DEFAULT 'M',
  `mobile_number` int(10) unsigned DEFAULT NULL,
  `age` tinyint(1) unsigned NOT NULL DEFAULT '1',
  `visits_count` mediumint(5) unsigned NOT NULL DEFAULT '0',
  PRIMARY KEY (`visitor_id`),
  KEY `indx_t1_test` (`visitor_id`,`visits_count`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
Run Code Online (Sandbox Code Playgroud)

表 2

CREATE TABLE `table_2` (
  `company_id` bigint(20) unsigned NOT NULL,
  `visitor_id` varchar(32) CHARACTER SET ascii NOT NULL,
  `time_duration` mediumint(5) unsigned NOT NULL DEFAULT '0',
  `visited_on` date NOT NULL,
  PRIMARY KEY (`company_id`,`visitor_id`,`visited_on`),
  KEY `indx_t2_test` (`visited_on`,`company_id`,`visitor_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
Run Code Online (Sandbox Code Playgroud)

我想检索的最基本数据

想要获得 20 个(分页)在选定时间(部分)期间按年龄(部分)GROUP BY / DISTINCT访问特定公司组(IN部分)的唯一访问者()。BETWEENORDER BY

查询 1

如果我为此写下第一个查询,那么它将是:

SELECT
    t1.visitor_id
FROM table_1 AS t1
INNER JOIN table_2 AS t2 ON t2.visitor_id = t1.visitor_id
WHERE
    t2.company_id IN (528,211,1275,521,1299,493,492,852,868,869,1235,486,485,1238,855,1237,651,538,1241,1240,548,543,1247,1253,490,468,582,583,569,477,488,802,1294,518,1274,476,545,1267,556,479,1266,1265,541,1189,1263,1152,1260,478,1257,885,1139,1256,804,708,547,561,1239,1142,1226,1148,1230,529,1223,1192,1191,874,830,822,818,817,794,718,487,709,706,705,669,513,455) AND
    t2.visited_on BETWEEN '2015-01-01' AND '2017-01-31'
GROUP BY t1.visitor_id
ORDER BY t1.`visits_count` DESC
LIMIT 20;
Run Code Online (Sandbox Code Playgroud)

当我为任何一家公司运行此查询时,它返回数据的速度足够快(当匹配行数较少时,查询性能良好)。

问题是当企业增加数量在IN查询的一部分(我需要支持100家公司进行查询的这个部分),它需要一定的时间大约36 seconds要返回结果。

Explain 此查询的输出是:

在此处输入图片说明

在此处输入图片说明

查询 2

对于同样的情况,我可以想到的第二个查询,那么它会是这样的:

SELECT
(
    SELECT  
        t2.visitor_id
    FROM table_2 AS t2
    WHERE 
        t2.company_id IN (528,211,1275,521,1299,493,492,852,868,869,1235,486,485,1238,855,1237,651,538,1241,1240,548,543,1247,1253,490,468,582,583,569,477,488,802,1294,518,1274,476,545,1267,556,479,1266,1265,541,1189,1263,1152,1260,478,1257,885,1139,1256,804,708,547,561,1239,1142,1226,1148,1230,529,1223,1192,1191,874,830,822,818,817,794,718,487,709,706,705,669,513,455)
        AND t2.visitor_id = t1.`visitor_id`
        AND t2.visited_on BETWEEN '2015-01-01' AND '2017-01-31'
    LIMIT 1
) AS visitor_id
FROM `table_1` AS t1
HAVING visitor_id IS NOT NULL
ORDER BY t1.`visits_count` DESC
LIMIT 0, 20
Run Code Online (Sandbox Code Playgroud)

此查询的行为与第一个相反。如果我对访问者很少的公司运行查询,则此查询的性能非常低(大约需要38 seconds)(只有一家公司的IN一部分,而该公司只有 3-4 名访问者)。当IN部分公司数量较多时,与一家公司相比,它返回​​结果更快(大约需要13 seconds),但仍然没有可用的性能。

Explain 此查询的输出是:

在此处输入图片说明

查询 3

为了消除IN部分查询的使用,我创建了临时表并在该表中添加了公司 ID,然后使用JOIN

SELECT
    DISTINCT
    t1.visitor_id
FROM `table_1` AS t1
INNER JOIN `table_2` AS t2 ON t1.`visitor_id` = t2.visitor_id
INNER JOIN temp_table AS t3 ON t3.company_id = t2.company_id
ORDER BY t1.`visits_count` DESC
LIMIT 0, 20;
Run Code Online (Sandbox Code Playgroud)

此查询也需要长达22 秒的时间。我需要2-3 seconds此列表的性能。

附加信息

  • innodb_buffer_pool_size 是 12GB
  • 内存为 30 GB
  • 我正在使用 AWS RDSdb.r3.xlarge实例
  • SHOW TABLE STATUS 输出如下:
    在此处输入图片说明

    1. 查询SELECT COUNT(*) FROM table_2 WHERE company_id IN (...) AND visited_on BETWEEN '2015-01-01' AND '2017-01-31'返回2660123

    2. 第一次只是需要时间。如果我再次运行相同的查询,它会快得多(0.2 秒)。但是,如果我LIMIT 20, 20再次将限制更改为then 它会重复24 seconds第一次,并且第二次相同的查询会更快。可能是因为innodb_buffer_pool_size

    3. 的输出EXPLAIN FORMAT=JSON SELECT ...;如下。

      {
      "query_block": {
      "select_id": 1,
      "ordering_operation": {
        "using_filesort": true,
        "grouping_operation": {
          "using_temporary_table": true,
          "using_filesort": false,
          "nested_loop": [
            {
              "table": {
                "table_name": "t2",
                "access_type": "range",
                "possible_keys": [
                  "PRIMARY",
                  "indx_t2_test"
                ],
                "key": "PRIMARY",
                "used_key_parts": [
                  "company_id"
                ],
                "key_length": "8",
                "rows": 17301,
                "filtered": 100,
                "using_index": true,
                "attached_condition": "((`db`.`t2`.`company_id` in (528,211,1275,521,1299,493,492,852,868,869,1235,486,485,1238,855,1237,651,538,1241,1240,548,543,1247,1253,490,468,582,583,569,477,488,802,1294,518,1274,476,545,1267,556,479,1266,1265,541,1189,1263,1152,1260,478,1257,885,1139,1256,804,708,547,561,1239,1142,1226,1148,1230,529,1223,1192,1191,874,830,822,818,817,794,718,487,709,706,705,669,513,455)) and (`db`.`t2`.`visited_on` between '2015-01-01' and '2017-01-31'))"
              }
            },
            {
              "table": {
                "table_name": "t1",
                "access_type": "eq_ref",
                "possible_keys": [
                  "PRIMARY",
                  "indx_t1_test"
                ],
                "key": "PRIMARY",
                "used_key_parts": [
                  "visitor_id"
                ],
                "key_length": "34",
                "ref": [
                  "db.t2.visitor_id"
                ],
                "rows": 1,
                "filtered": 100
              }
            }
          ]
        }
      }
      }
      }
      
      Run Code Online (Sandbox Code Playgroud)

Rick James 建议的查询输出:

SELECT
    t2.visitor_id
FROM (
    SELECT
        DISTINCT visitor_id
    FROM table_2
    WHERE 
        company_id IN (528,211,1275,521,1299,493,492,852, 868,
                        869,1235,486,485,1238,855,1237,651,538,1241,1240, 548,
                        543,1247,1253,490,468,582,583,569,477,488,802,1294, 518,
                        1274,476,545,1267,556,479,1266,1265,541,1189,1263, 1152,
                        1260,478,1257,885,1139,1256,804,708,547,561,1239, 1142,
                        1226,1148,1230,529,1223,1192,1191,874,830,822,818, 817,
                        794,718,487,709,706,705,669,513,455)
        AND visited_on BETWEEN '2015-01-01' AND '2017-01-31'
) AS t2
INNER JOIN table_1 AS t1 ON t2.visitor_id = t1.visitor_id
ORDER BY t1.`visits_count` DESC
LIMIT 20;
Run Code Online (Sandbox Code Playgroud)

Explain 查询的输出如下:

在此处输入图片说明

此查询需要 58 秒

在此处输入图片说明

Explain 内部子查询的输出如下

在此处输入图片说明

在此处输入图片说明


查询:

SELECT
    COUNT(DISTINCT company_id, visited_on, visitor_id), 
    COUNT(DISTINCT company_id, LEFT(visited_on, 7), visitor_id), 
    COUNT(*) 
FROM table_2;
Run Code Online (Sandbox Code Playgroud)

返回:

  • COUNT(DISTINCT company_id, visited_on, visitor_id) = 7607938。
  • COUNT(DISTINCT company_id, LEFT(visited_on, 7), visitor_id) = 5194480
  • COUNT(*) = 7607938

请注意,此输出是最新数据,因此行数count(*)可能会增加。

Ric*_*mes 2

age int(3) unsigned- 这允许您存储高达 40 亿的年龄并浪费 4 个字节。更改为TINYINT UNSIGNED(1 个字节)。

名称用 Ascii 表示吗?仅限美国?即便如此,还是不​​允许使用一些奇怪的名字。

我对 t2 感到困惑PRIMARY KEY。由于 PK 是唯一的,因此不允许记录一个人多次访问一家公司。如果限制没问题,请添加以下内容(如果优化器确定数据范围是最佳过滤器):

INDEX(visited_on, conpany_id, visitor_id)
Run Code Online (Sandbox Code Playgroud)

如果我的预感是正确的,那么更改 PK 并添加索引:

PRIMARY KEY(`company_id`, `visitor_id`, visited_on),
INDEX(visited_on, conpany_id, visitor_id)
Run Code Online (Sandbox Code Playgroud)

然后检查您的各种查询。