允许在HAVING子句中使用别名的性能影响

Tim*_*sen 8 mysql sql sql-server having

今天早些时候我在这个问题上做了一点傻瓜.问题是使用SQL Server,正确的答案涉及添加一个HAVING子句.我犯的最初错误是认为SELECT语句中的别名可以在HAVING子句中使用,这在SQL Server中是不允许的.我犯了这个错误是因为我认为SQL Server与MySQL有相同的规则,它允许在HAVING子句中使用别名.

这让我很好奇,我在Stack Overflow和其他地方探索过,发现了一堆材料,解释了为什么在两个相应的RDBMS上强制实施这些规则.但我没有找到解释在该条款中允许/禁止别名的性能影响的解释HAVING.

举一个具体的例子,我将复制上述问题中出现的查询:

SELECT students.camID, campus.camName, COUNT(students.stuID) as studentCount
FROM students
JOIN campus
    ON campus.camID = students.camID
GROUP BY students.camID, campus.camName
HAVING COUNT(students.stuID) > 3
ORDER BY studentCount
Run Code Online (Sandbox Code Playgroud)

HAVING子句中使用别名而不是重新指定COUNT?的性能影响是什么?这个问题可以在MySQL中直接回答,希望有人可以深入了解SQL中如果支持该HAVING子句中的别名会发生什么.

这是一个罕见的实例,可以用MySQL和SQL Server标记SQL问题,所以在阳光下享受这一刻.

Dre*_*rew 4

只专注于该特定查询,并在下面加载示例数据。这确实解决了其他一些问题,例如count(distinct ...)其他人提到的问题。

alias in the HAVING似乎稍微优于或远远优于其替代方案(取决于查询)。

这使用了一个预先存在的表,其中包含大约 500 万行,通过我的这个答案快速创建,需要 3 到 5 分钟。

结果结构:

CREATE TABLE `ratings` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `thing` int(11) NOT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=5046214 DEFAULT CHARSET=utf8;
Run Code Online (Sandbox Code Playgroud)

而是使用 INNODB 代替。由于范围保留插入而产生预期的 INNODB 间隙异常。只是说说而已,但没有什么区别。470 万行。

修改该表以接近 Tim 假设的架构。

rename table ratings to students; -- not exactly instanteous (a COPY)
alter table students add column camId int; -- get it near Tim's schema
-- don't add the `camId` index yet
Run Code Online (Sandbox Code Playgroud)

接下来需要一段时间。一次又一次地运行它,否则你的连接可能会超时。超时是由于更新语句中没有 LIMIT 子句导致 500 万行。请注意,我们确实有一个 LIMIT 子句。

所以我们要进行 50 万行迭代。将列设置为 1 到 20 之间的随机数

update students set camId=floor(rand()*20+1) where camId is null limit 500000; -- well that took a while (no surprise)
Run Code Online (Sandbox Code Playgroud)

继续运行上面的代码,直到 nocamId为空。

我运行了大约 10 次(整个过程需要 7 到 10 分钟)

select camId,count(*) from students
group by camId order by 1 ;

1   235641
2   236060
3   236249
4   235736
5   236333
6   235540
7   235870
8   236815
9   235950
10  235594
11  236504
12  236483
13  235656
14  236264
15  236050
16  236176
17  236097
18  235239
19  235556
20  234779

select count(*) from students;
-- 4.7 Million rows
Run Code Online (Sandbox Code Playgroud)

创建一个有用的索引(当然是在插入之后)。

create index `ix_stu_cam` on students(camId); -- takes 45 seconds

ANALYZE TABLE students; -- update the stats: http://dev.mysql.com/doc/refman/5.7/en/analyze-table.html
-- the above is fine, takes 1 second
Run Code Online (Sandbox Code Playgroud)

创建校园表。

create table campus
(   camID int auto_increment primary key,
    camName varchar(100) not null
);
insert campus(camName) values
('one'),('2'),('3'),('4'),('5'),
('6'),('7'),('8'),('9'),('ten'),
('etc'),('etc'),('etc'),('etc'),('etc'),
('etc'),('etc'),('etc'),('etc'),('twenty');
-- ok 20 of them
Run Code Online (Sandbox Code Playgroud)

运行两个查询:

SELECT students.camID, campus.camName, COUNT(students.id) as studentCount 
FROM students 
JOIN campus 
    ON campus.camID = students.camID 
GROUP BY students.camID, campus.camName 
HAVING COUNT(students.id) > 3 
ORDER BY studentCount; 
-- run it many many times, back to back, 5.50 seconds, 20 rows of output
Run Code Online (Sandbox Code Playgroud)

SELECT students.camID, campus.camName, COUNT(students.id) as studentCount 
FROM students 
JOIN campus 
    ON campus.camID = students.camID 
GROUP BY students.camID, campus.camName 
HAVING studentCount > 3 
ORDER BY studentCount; 
-- run it many many times, back to back, 5.50 seconds, 20 rows of output
Run Code Online (Sandbox Code Playgroud)

所以时间是相同的。每人跑十几次。

两者的输出EXPLAIN相同

+----+-------------+----------+------+---------------+------------+---------+----------------------+--------+---------------------------------+
| id | select_type | table    | type | possible_keys | key        | key_len | ref                  | rows   | Extra                           |
+----+-------------+----------+------+---------------+------------+---------+----------------------+--------+---------------------------------+
|  1 | SIMPLE      | campus   | ALL  | PRIMARY       | NULL       | NULL    | NULL                 |     20 | Using temporary; Using filesort |
|  1 | SIMPLE      | students | ref  | ix_stu_cam    | ix_stu_cam | 5       | bigtest.campus.camID | 123766 | Using index                     |
+----+-------------+----------+------+---------------+------------+---------+----------------------+--------+---------------------------------+
Run Code Online (Sandbox Code Playgroud)

使用 AVG() 函数,通过以下两个查询中的别名having(具有相同的输出),我的性能提高了约 12%。EXPLAIN

SELECT students.camID, campus.camName, avg(students.id) as studentAvg 
FROM students 
JOIN campus 
    ON campus.camID = students.camID 
GROUP BY students.camID, campus.camName 
HAVING avg(students.id) > 2200000 
ORDER BY students.camID; 
-- avg time 7.5

explain 

SELECT students.camID, campus.camName, avg(students.id) as studentAvg 
FROM students 
JOIN campus 
    ON campus.camID = students.camID 
GROUP BY students.camID, campus.camName 
HAVING studentAvg > 2200000
ORDER BY students.camID;
-- avg time 6.5
Run Code Online (Sandbox Code Playgroud)

最后,DISTINCT

SELECT students.camID, count(distinct students.id) as studentDistinct 
FROM students 
JOIN campus 
    ON campus.camID = students.camID 
GROUP BY students.camID 
HAVING count(distinct students.id) > 1000000 
ORDER BY students.camID; -- 10.6   10.84   12.1   11.49   10.1   9.97   10.27   11.53   9.84 9.98
-- 9.9

 SELECT students.camID, count(distinct students.id) as studentDistinct 
 FROM students 
 JOIN campus 
    ON campus.camID = students.camID 
 GROUP BY students.camID 
 HAVING studentDistinct > 1000000 
 ORDER BY students.camID; -- 6.81    6.55   6.75   6.31   7.11 6.36   6.55
-- 6.45
Run Code Online (Sandbox Code Playgroud)

在相同的输出下,具有的别名始终运行速度快 35%EXPLAIN。见下。因此,相同的解释输出已显示两次,但不会产生相同的性能,而是作为一般线索。

+----+-------------+----------+-------+---------------+------------+---------+----------------------+--------+----------------------------------------------+
| id | select_type | table    | type  | possible_keys | key        | key_len | ref                  | rows   | Extra                                        |
+----+-------------+----------+-------+---------------+------------+---------+----------------------+--------+----------------------------------------------+
|  1 | SIMPLE      | campus   | index | PRIMARY       | PRIMARY    | 4       | NULL                 |     20 | Using index; Using temporary; Using filesort |
|  1 | SIMPLE      | students | ref   | ix_stu_cam    | ix_stu_cam | 5       | bigtest.campus.camID | 123766 | Using index                                  |
+----+-------------+----------+-------+---------------+------------+---------+----------------------+--------+----------------------------------------------+
Run Code Online (Sandbox Code Playgroud)

优化器目前似乎更倾向于使用别名,特别是对于DISTINCT.