Mar*_*ssa 4 mysql aggregate group-by
假设下表t1
:
================== | 标签 | 价值 | --+ 为简单起见,val 是非 NULL ================== | a1 | v1 | | a1 | v2 | | a1 | v3 | | a1 | v4 | | a1 | v5 | | a2 | v6 | | a2 | v7 | | a2 | v8 | | a2 | v9 | | ... | ... | ==================
如果您在 MySQL 中执行以下脚本:
SELECT `tag`, AVG(`val`) FROM `t1` GROUP BY `tag`
Run Code Online (Sandbox Code Playgroud)
您将获得按列分组的平均值tag
:
================== | 标签 | 平均 () | ================== | a1 | 平均 1 | | a2 | 平均 2 | | a3 | 平均 3 | | a4 | 平均 4 | | ... | ... | ==================
此外AVG()
,MySQL 有几个其他内置函数来计算聚合值(例如SUM()
,MAX()
,COUNT()
, 和STD()
),它们的使用方式与上述脚本中的使用方式相同。但是,没有针对中值的内置函数。
这个问题在 SE 已经出现过几次了;但是,它们中的大多数都与没有GROUP BY
. 唯一一个GROUP BY
似乎是MySql: Count mediumed by day ; 然而,剧本似乎过于复杂。
计算这个中位数的简单方法(如果可能)是什么?
补充公认答案的优秀文章:http :
//danielsetzermann.com/howto/how-to-calculate-the-median-per-group-with-mysql/
此查询可以回答您的问题:中值和分组依据
SELECT tag, AVG(val) as median
FROM
(
SELECT tag, val,
(SELECT count(*) FROM median t2 WHERE t2.tag = t3.tag) as ct,
seq,
(SELECT count(*) FROM median t2 WHERE t2.tag < t3.tag) as delta
FROM (SELECT tag, val, @rownum := @rownum + 1 as seq
FROM (SELECT * FROM median ORDER BY tag, val) t1
ORDER BY tag, seq
) t3 CROSS JOIN (SELECT @rownum := 0) x
HAVING (ct%2 = 0 and seq-delta between floor((ct+1)/2) and floor((ct+1)/2) +1)
or (ct%2 <> 0 and seq-delta = (ct+1)/2)
) T
GROUP BY tag
ORDER BY tag;
Run Code Online (Sandbox Code Playgroud)
我在这个数据集上尝试过(主要来自这里):
+------+------+
| tag | val |
+------+------+
| 1 | 3 |
| 1 | 13 |
Run Code Online (Sandbox Code Playgroud)
...(见下面的解释)
| 3 | 12 |
| 3 | 43 |
| 3 | 15 |
+------+------+
Run Code Online (Sandbox Code Playgroud)
结果是:
+------+---------+
| tag | median |
+------+---------+
| 1 | 23.0000 |
| 2 | 22.0000 |
| 3 | 15.0000 |
+------+---------+
Run Code Online (Sandbox Code Playgroud)
将首先计算内部子查询:序列是 (1)(2)(3)(4)。
-- (4) 计算平均值(2 行或 1 行)
SELECT tag, AVG(val) as median
FROM
(
Run Code Online (Sandbox Code Playgroud)
-- (3) 获取行来计算中值
SELECT tag, val,
(SELECT count(*) FROM median t2 -- +number of lines for the current tag value as ct
WHERE t2.tag = t3.tag) as ct,
seq,
(SELECT count(*) FROM median t2 -- +number of lines before the current tag value as delta
WHERE t2.tag < t3.tag) as delta -- to compute the starting line number of a tag
FROM (
Run Code Online (Sandbox Code Playgroud)
-- (2) 按标签和序列对数据集进行排序
SELECT tag, val,
@rownum := @rownum + 1 as seq -- +@rownum enable to create a sequence from 0 by 1
FROM (
Run Code Online (Sandbox Code Playgroud)
-- (1) 按标签和值对数据集进行排序
SELECT * FROM median
ORDER BY tag, val) t1
Run Code Online (Sandbox Code Playgroud)
-- (2) 在这里继续
ORDER BY tag, seq
) t3 CROSS JOIN (SELECT @rownum := 0) x -- +use to set @rownum to 0 (no data)
Run Code Online (Sandbox Code Playgroud)
-- (3) 在这里继续
HAVING (ct%2 = 0 -- +when ct is even, select the two lines around the middle
and seq-delta between floor((ct+1)/2)
and floor((ct+1)/2) +1)
or (ct%2 <> 0 -- +when ct is odd, select the one line in the middle
and seq-delta = (floor(ct+1)/2))
) T
Run Code Online (Sandbox Code Playgroud)
-- (4) 在这里继续
GROUP BY tag
ORDER BY tag;
Run Code Online (Sandbox Code Playgroud)
数据集:
after (1) after (2) processing (3)
+------+------+
| tag | val | ct delta seq seq-delta
+------+------+
| 1 | 3 | 15 0 1 1 ct : odd ct%2 <> 0
| 1 | 5 | 15 0 2 2 floor((ct+1)/2) : 8
| 1 | 7 | 15 0 3 3
| 1 | 12 | 15 0 4 4
| 1 | 13 | 15 0 5 5
| 1 | 14 | 15 0 6 6
| 1 | 21 | 15 0 7 7
| 1 | 23 | 15 0 8 8 ---> keep this line
| 1 | 23 | 15 0 9 9
| 1 | 23 | 15 0 10 10
| 1 | 23 | 15 0 11 11
| 1 | 29 | 15 0 12 12
| 1 | 39 | 15 0 13 13
| 1 | 40 | 15 0 14 14
| 1 | 56 | 15 0 15 15
| 2 | 3 | 14 15 16 1 ct : even (ct%2 = 0 )
| 2 | 5 | 14 15 17 2 floor((ct+1)/2) : 7
| 2 | 7 | 14 15 18 3 floor((ct+1)/2)+1 : 8
| 2 | 12 | 14 15 19 4
| 2 | 13 | 14 15 20 5
| 2 | 14 | 14 15 21 6
| 2 | 21 | 14 15 22 7 ---> keep this line
| 2 | 23 | 14 15 23 8 ---> keep this line
| 2 | 23 | 14 15 24 9
| 2 | 23 | 14 15 25 10
| 2 | 23 | 14 15 26 11
| 2 | 29 | 14 15 27 12
| 2 | 40 | 14 15 28 13
| 2 | 56 | 14 15 29 14
| 3 | 12 | 3 29 30 1 ct : odd ct%2 <> 0
| 3 | 15 | 3 29 31 2 ---> keep floor((ct+1)/2) : 2
| 3 | 43 | 3 29 32 3
+------+------+
Run Code Online (Sandbox Code Playgroud)
(3) 之后的数据集
+------+------+------+------+-------+
| tag | val | ct | seq | delta |
+------+------+------+------+-------+
| 1 | 23 | 15 | 8 | 0 |
| 2 | 21 | 14 | 22 | 15 |
| 2 | 23 | 14 | 23 | 15 |
| 3 | 15 | 3 | 31 | 29 |
+------+------+------+------+-------+
Run Code Online (Sandbox Code Playgroud)
外部查询将按标签值计算 avg(val) 组。
希望这可以帮助。
但是当存在空值时中值计算呢?请参阅下面的 EDIT2
DELIMITER //
CREATE FUNCTION median(pTag int)
RETURNS real
READS SQL DATA
DETERMINISTIC
BEGIN
DECLARE r real; -- result
SELECT AVG(val) INTO r
FROM
(
SELECT val,
(SELECT count(*) FROM median WHERE tag = pTag) as ct,
seq
FROM (SELECT val, @rownum := @rownum + 1 as seq
FROM (SELECT * FROM median WHERE tag = pTag ORDER BY val ) t1
ORDER BY seq
) t3
CROSS JOIN (SELECT @rownum := 0) x
HAVING (ct%2 = 0 and seq between floor((ct+1)/2) and floor((ct+1)/2) +1)
or (ct%2 <> 0 and seq = (ct+1)/2)
) T;
return r;
END//
DELIMITER ;
Run Code Online (Sandbox Code Playgroud)
但是会为每一行调用该函数:
SELECT tag, median(tag) FROM median; -- my test table is 'median' too...
Run Code Online (Sandbox Code Playgroud)
这个查询会“更好”:
select tag, median(tag)
from (select distinct tag from median) t;
Run Code Online (Sandbox Code Playgroud)
这就是我能做的!希望能帮助到你!
使用 WHERE 子句从源数据中省略空值显示: WHERE val IS NOT NULL
在计算行数的 2 个子查询和获取数据的子查询中,。
它应该放在最深的级别:以便它在查询的执行中最快地声明。
DELIMITER //
CREATE FUNCTION median(pTag int)
RETURNS real
READS SQL DATA
DETERMINISTIC
BEGIN
DECLARE r real; -- result
SELECT AVG(val) INTO r
FROM
(
SELECT val,
(SELECT count(*) FROM median WHERE tag = pTag and val is not null) as ct,
seq
FROM (SELECT val, @rownum := @rownum + 1 as seq
FROM (SELECT * FROM median
CROSS JOIN (SELECT @rownum := 0) x -- INIT @rownum here
WHERE tag = pTag and val is not null ORDER BY val
) t1
ORDER BY seq
) t3
HAVING (ct%2 = 0 and seq between floor((ct+1)/2) and floor((ct+1)/2) +1)
or (ct%2 <> 0 and seq = (ct+1)/2)
) T;
return r;
END//
DELIMITER ;
Run Code Online (Sandbox Code Playgroud)
查询也是如此。
再用 2 个数据集进行测试:
| 4 | NULL |
| 4 | 10 |
| 4 | 15 |
| 4 | 20 |
| 5 | NULL |
| 5 | NULL |
| 5 | NULL |
+------+------+
Run Code Online (Sandbox Code Playgroud)
39 行(0.00 秒)
+------+--------------+
| tag | median2(tag) |
+------+--------------+
| 1 | 23 |
| 2 | 22 |
| 3 | 15 |
| 4 | 15 |
| 5 | NULL |
+------+--------------+
5 rows in set (0.08 sec)
Run Code Online (Sandbox Code Playgroud)