如何通过查询在数百万行中优化计数和排序

Irf*_*gwb 14 mysql database database-administration

在优化order by和count查询时需要帮助,我有数百万(约3百万)行的表.

我必须连接4个表并获取记录,当我运行简单查询时,它只需要毫秒才能完成,但是当我尝试通过离开连接表来计数或排序时,它会无限期地停留.

请参阅以下案例.

DB服务器配置:

CPU Number of virtual cores: 4
Memory(RAM): 16 GiB
Network Performance: High
Run Code Online (Sandbox Code Playgroud)

每张表中的行:

tbl_customers -  #Rows: 20 million.
tbl_customers_address -  #Row 25 million.
tbl_shop_setting - #Rows 50k
aio_customer_tracking - #Rows 5k
Run Code Online (Sandbox Code Playgroud)

表格架构:

CREATE TABLE `tbl_customers` (
    `id` BIGINT(20) UNSIGNED NOT NULL AUTO_INCREMENT,
    `shopify_customer_id` BIGINT(20) UNSIGNED NOT NULL,
    `shop_id` BIGINT(20) UNSIGNED NOT NULL,
    `email` VARCHAR(225) NULL DEFAULT NULL COLLATE 'latin1_swedish_ci',
    `accepts_marketing` TINYINT(1) NULL DEFAULT NULL,
    `first_name` VARCHAR(50) NULL DEFAULT NULL COLLATE 'latin1_swedish_ci',
    `last_name` VARCHAR(50) NULL DEFAULT NULL COLLATE 'latin1_swedish_ci',
    `last_order_id` BIGINT(20) NULL DEFAULT NULL,
    `total_spent` DECIMAL(12,2) NULL DEFAULT NULL,
    `phone` VARCHAR(20) NULL DEFAULT NULL COLLATE 'latin1_swedish_ci',
    `verified_email` TINYINT(4) NULL DEFAULT NULL,
    `updated_at` DATETIME NULL DEFAULT NULL,
    `created_at` DATETIME NULL DEFAULT NULL,
    `date_updated` DATETIME NULL DEFAULT NULL,
    `date_created` DATETIME NULL DEFAULT NULL,
    PRIMARY KEY (`id`),
    UNIQUE INDEX `shopify_customer_id_unique` (`shopify_customer_id`),
    INDEX `email` (`email`),
    INDEX `shopify_customer_id` (`shopify_customer_id`),
    INDEX `shop_id` (`shop_id`)
)
COLLATE='utf8mb4_general_ci'
ENGINE=InnoDB;


CREATE TABLE `tbl_customers_address` (
    `id` BIGINT(20) NOT NULL AUTO_INCREMENT,
    `customer_id` BIGINT(20) NULL DEFAULT NULL,
    `shopify_address_id` BIGINT(20) NULL DEFAULT NULL,
    `shopify_customer_id` BIGINT(20) NULL DEFAULT NULL,
    `first_name` VARCHAR(50) NULL DEFAULT NULL,
    `last_name` VARCHAR(50) NULL DEFAULT NULL,
    `company` VARCHAR(50) NULL DEFAULT NULL,
    `address1` VARCHAR(250) NULL DEFAULT NULL,
    `address2` VARCHAR(250) NULL DEFAULT NULL,
    `city` VARCHAR(50) NULL DEFAULT NULL,
    `province` VARCHAR(50) NULL DEFAULT NULL,
    `country` VARCHAR(50) NULL DEFAULT NULL,
    `zip` VARCHAR(15) NULL DEFAULT NULL,
    `phone` VARCHAR(20) NULL DEFAULT NULL,
    `name` VARCHAR(50) NULL DEFAULT NULL,
    `province_code` VARCHAR(5) NULL DEFAULT NULL,
    `country_code` VARCHAR(5) NULL DEFAULT NULL,
    `country_name` VARCHAR(50) NULL DEFAULT NULL,
    `longitude` VARCHAR(250) NULL DEFAULT NULL,
    `latitude` VARCHAR(250) NULL DEFAULT NULL,
    `default` TINYINT(1) NULL DEFAULT NULL,
    `is_geo_fetched` TINYINT(1) NOT NULL DEFAULT '0',
    PRIMARY KEY (`id`),
    INDEX `customer_id` (`customer_id`),
    INDEX `shopify_address_id` (`shopify_address_id`),
    INDEX `shopify_customer_id` (`shopify_customer_id`)
)
COLLATE='latin1_swedish_ci'
ENGINE=InnoDB;

CREATE TABLE `tbl_shop_setting` (
    `id` INT(11) NOT NULL AUTO_INCREMENT,   
    `shop_name` VARCHAR(300) NOT NULL COLLATE 'latin1_swedish_ci',
     PRIMARY KEY (`id`),
)
COLLATE='utf8mb4_general_ci'
ENGINE=InnoDB;


CREATE TABLE `aio_customer_tracking` (
    `id` BIGINT(20) UNSIGNED NOT NULL AUTO_INCREMENT,
    `shopify_customer_id` BIGINT(20) UNSIGNED NOT NULL,
    `email` VARCHAR(255) NULL DEFAULT NULL,
    `shop_id` BIGINT(20) UNSIGNED NOT NULL,
    `domain` VARCHAR(255) NULL DEFAULT NULL,
    `web_session_count` INT(11) NOT NULL,
    `last_seen_date` DATETIME NULL DEFAULT NULL,
    `last_contact_date` DATETIME NULL DEFAULT NULL,
    `last_email_open` DATETIME NULL DEFAULT NULL,
    `created_date` DATETIME NOT NULL,
    `is_geo_fetched` TINYINT(1) NOT NULL DEFAULT '0',
    PRIMARY KEY (`id`),
    INDEX `shopify_customer_id` (`shopify_customer_id`),
    INDEX `email` (`email`),
    INDEX `shopify_customer_id_shop_id` (`shopify_customer_id`, `shop_id`),
    INDEX `last_seen_date` (`last_seen_date`)
)
COLLATE='latin1_swedish_ci'
ENGINE=InnoDB;
Run Code Online (Sandbox Code Playgroud)

运行和未运行的查询案例:

1. Running:  Below query fetch the records by joining all the 4 tables, It takes only 0.300 ms.

SELECT `c`.first_name,`c`.last_name,`c`.email, `t`.`last_seen_date`, `t`.`last_contact_date`, `ssh`.`shop_name`, ca.`company`, ca.`address1`, ca.`address2`, ca.`city`, ca.`province`, ca.`country`, ca.`zip`, ca.`province_code`, ca.`country_code`
FROM `tbl_customers` AS `c`
JOIN `tbl_shop_setting` AS `ssh` ON c.shop_id = ssh.id 
LEFT JOIN (SELECT shopify_customer_id, last_seen_date, last_contact_date FROM aio_customer_tracking GROUP BY shopify_customer_id) as t ON t.shopify_customer_id = c.shopify_customer_id
LEFT JOIN `tbl_customers_address` as ca ON (c.shopify_customer_id = ca.shopify_customer_id AND ca.default = 1)
GROUP BY c.shopify_customer_id
LIMIT 20

2. Not running: Simply when try to get the count of these row stuk the query, I waited 10 min but still running.

SELECT 
     COUNT(DISTINCT c.shopify_customer_id)   -- what makes #2 different
FROM `tbl_customers` AS `c`
JOIN `tbl_shop_setting` AS `ssh` ON c.shop_id = ssh.id 
LEFT JOIN (SELECT shopify_customer_id, last_seen_date, last_contact_date FROM aio_customer_tracking GROUP BY shopify_customer_id) as t ON t.shopify_customer_id = c.shopify_customer_id
LEFT JOIN `tbl_customers_address` as ca ON (c.shopify_customer_id = ca.shopify_customer_id AND ca.default = 1)
GROUP BY c.shopify_customer_id
LIMIT 20


3. Not running: In the #1 query we simply put the 1 Order by clause and it get stuck, I waited 10 min but still running. I study query optimization some article and tried by indexing, Right Join etc.. but still not working.

SELECT `c`.first_name,`c`.last_name,`c`.email, `t`.`last_seen_date`, `t`.`last_contact_date`, `ssh`.`shop_name`, ca.`company`, ca.`address1`, ca.`address2`, ca.`city`, ca.`province`, ca.`country`, ca.`zip`, ca.`province_code`, ca.`country_code`
FROM `tbl_customers` AS `c`
JOIN `tbl_shop_setting` AS `ssh` ON c.shop_id = ssh.id 
LEFT JOIN (SELECT shopify_customer_id, last_seen_date, last_contact_date FROM aio_customer_tracking GROUP BY shopify_customer_id) as t ON t.shopify_customer_id = c.shopify_customer_id
LEFT JOIN `tbl_customers_address` as ca ON (c.shopify_customer_id = ca.shopify_customer_id AND ca.default = 1)
GROUP BY c.shopify_customer_id
  ORDER BY `t`.`last_seen_date`    -- what makes #3 different
LIMIT 20
Run Code Online (Sandbox Code Playgroud)

EXPLAIN QUERY#1: 在此输入图像描述

EXPLAIN QUERY#2: 在此输入图像描述

EXPLAIN QUERY#3: 在此输入图像描述

任何建议优化查询,表结构是受欢迎的.

我在做什么:

tbl_customers表包含客户信息,tbl_customer_address表包含客户的地址(一个客户可能有多个地址),而aio_customer_tracking表包含客户的访问记录last_seen_date是访问日期.

现在,我只想用他们的地址和访问信息来获取和统计客户.此外,我可以通过这3个表中的任何一个列进行排序.在我的示例中,我按last_seen_date(默认顺序)排序.希望这个解释有助于理解我想要做的事情.

Ric*_*mes 7

在查询#1中,而不是其他两个,优化器可以使用

UNIQUE INDEX `shopify_customer_id_unique` (`shopify_customer_id`)
Run Code Online (Sandbox Code Playgroud)

削减查询的简称

GROUP BY c.shopify_customer_id
LIMIT 20
Run Code Online (Sandbox Code Playgroud)

这是因为它可以在索引的20个项目后停止.查询不是超快的,因为派生表(子查询t)命中大约51K行.

查询#2可能很慢,因为优化器无法注意到并删除了冗余DISTINCT.相反,它可能会认为它不能在20后停止.

查询#3 必须完全通过表c来获取每个 shopify_customer_id组.这是因为ORDER BY可以防止短暂的电流进入LIMIT 20.

a中的列GROUP BY必须包括SELECT除列之外由列唯一定义的所有非聚合列.既然你已经说过一个地址可以有多个地址shopify_customer_id,那么提取就不合适ca.address1GROUP BY shopify_customer_id.同样,子查询似乎是不合适的last_seen_date, last_contact_date.

aio_customer_tracking,这种变化(对于"覆盖"指数)可能有所帮助:

INDEX (`shopify_customer_id`)
Run Code Online (Sandbox Code Playgroud)

INDEX (`shopify_customer_id`, `last_seen_date`, `last_contact_date`)
Run Code Online (Sandbox Code Playgroud)

解剖目标

现在,我只想...计算客户数量

要计算客户数量,请执行此操作,但不要尝试将其与"提取"结合使用:

SELECT COUNT(*) FROM tbl_customers;
Run Code Online (Sandbox Code Playgroud)

现在,我只是想取...客户......

tbl_customers - #Rows:2000万.

当然你不想要获取2000万行!我不想考虑如何尝试这样做.请澄清.我不会接受通过这么多行的分页.也许有一个WHERE条款?该WHERE子句(通常)是优化中最重要的部分!

现在,简单地说,我想通过他们的地址和访问信息来获取客户.

假设WHERE过滤到"少数"客户,然后JOINing到另一个表以获得"任何"地址和"任何"访问信息,可能是有问题的和/或效率低的.要求"第一"或"最后"而不是"任何"将不会更容易,但可能更有意义.

我可以建议您的UI首先找到一些客户,然后如果用户想要,请转到包含所有地址和所有访问的另一个页面.或者访问量可以达到数百个还是更多?

此外,我可以通过这3个表中的任何一个列进行排序.在我的示例中,我按last_seen_date(默认顺序)排序.

让我们专注于优化WHERE,然后last_seen_date在任何索引的末尾添加.


小智 4

shopify_customer_id在表中是唯一的tbl_customers,那么在第二个查询中为什么在列中使用不同和分组依据shopify_customer_id

请摆脱它。