带 GROUP BY 的中位数

Mar*_*ssa 4 mysql aggregate group-by

假设下表t1

==================
| 标签 | 价值 | --+ 为简单起见,val 是非 NULL
==================
| a1 | v1 |
| a1 | v2 |
| a1 | v3 |
| a1 | v4 |
| a1 | v5 |
| a2 | v6 |
| a2 | v7 |
| a2 | v8 |
| a2 | v9 |
| ... | ... |
==================

如果您在 MySQL 中执行以下脚本:

SELECT `tag`, AVG(`val`) FROM `t1` GROUP BY `tag`
Run Code Online (Sandbox Code Playgroud)

您将获得按列分组的平均值tag

==================
| 标签 | 平均 () |
==================
| a1 | 平均 1 |
| a2 | 平均 2 |
| a3 | 平均 3 |
| a4 | 平均 4 |
| ... | ... |
==================

此外AVG(),MySQL 有几个其他内置函数来计算聚合值(例如SUM()MAX()COUNT(), 和STD()),它们的使用方式与上述脚本中的使用方式相同。但是,没有针对中值的内置函数。

这个问题在 SE 已经出现过几次了;但是,它们中的大多数都与没有GROUP BY. 唯一一个GROUP BY似乎是MySql: Count mediumed by day ; 然而,剧本似乎过于复杂。

计算这个中位数的简单方法(如果可能)是什么?

跟进

补充公认答案的优秀文章:http :
//danielsetzermann.com/howto/how-to-calculate-the-median-per-group-with-mysql/

Pat*_*che 8

此查询可以回答您的问题:值和分组依据

            SELECT tag, AVG(val) as median
            FROM 
            (
              SELECT tag, val,
                  (SELECT count(*) FROM median t2 WHERE t2.tag = t3.tag) as ct,
                  seq,
                  (SELECT count(*) FROM median t2 WHERE t2.tag < t3.tag) as delta
                FROM (SELECT tag, val, @rownum := @rownum + 1 as seq
                      FROM (SELECT * FROM median ORDER BY tag, val) t1 
                      ORDER BY tag, seq
                    ) t3 CROSS JOIN (SELECT @rownum := 0) x
                HAVING (ct%2 = 0 and seq-delta between floor((ct+1)/2) and floor((ct+1)/2) +1)
                  or (ct%2 <> 0 and seq-delta = (ct+1)/2)
            ) T
            GROUP BY tag
            ORDER BY tag;
Run Code Online (Sandbox Code Playgroud)

我在这个数据集上尝试过(主要来自这里):

            +------+------+
            | tag  | val  |
            +------+------+
            |    1 |    3 |
            |    1 |   13 |
Run Code Online (Sandbox Code Playgroud)

...(见下面的解释)

            |    3 |   12 |
            |    3 |   43 |
            |    3 |   15 |
            +------+------+
Run Code Online (Sandbox Code Playgroud)

结果是:

            +------+---------+
            | tag  | median  |
            +------+---------+
            |    1 | 23.0000 |
            |    2 | 22.0000 |
            |    3 | 15.0000 |
            +------+---------+
Run Code Online (Sandbox Code Playgroud)

解释

将首先计算内部子查询:序列是 (1)(2)(3)(4)。

-- (4) 计算平均值(2 行或 1 行)

    SELECT tag, AVG(val) as median                          
      FROM 
      (
Run Code Online (Sandbox Code Playgroud)

-- (3) 获取行来计算中值

        SELECT tag, val,                                       
           (SELECT count(*) FROM median t2                    -- +number of lines for the current tag value as ct
              WHERE t2.tag = t3.tag) as ct,
           seq,
           (SELECT count(*) FROM median t2                    -- +number of lines before the current tag value as delta
              WHERE t2.tag < t3.tag) as delta                --     to compute the starting line number of a tag
         FROM (
Run Code Online (Sandbox Code Playgroud)

-- (2) 按标签和序列对数据集进行排序

                SELECT tag, val,                            
                    @rownum := @rownum + 1 as seq       -- +@rownum enable to create a sequence from 0 by 1
              FROM (
Run Code Online (Sandbox Code Playgroud)

-- (1) 按标签和值对数据集进行排序

                    SELECT * FROM median           
                    ORDER BY tag, val) t1 
Run Code Online (Sandbox Code Playgroud)

-- (2) 在这里继续

              ORDER BY tag, seq
            ) t3 CROSS JOIN (SELECT @rownum := 0) x            -- +use to set @rownum to 0 (no data)
Run Code Online (Sandbox Code Playgroud)

-- (3) 在这里继续

         HAVING (ct%2 = 0                                      -- +when ct is even, select the two lines around the middle
                  and seq-delta between floor((ct+1)/2) 
                                and floor((ct+1)/2) +1)
           or (ct%2 <> 0                                       -- +when ct is odd, select the one line in the middle
                  and seq-delta = (floor(ct+1)/2))
      ) T
Run Code Online (Sandbox Code Playgroud)

-- (4) 在这里继续

      GROUP BY tag
      ORDER BY tag;
Run Code Online (Sandbox Code Playgroud)

数据集:

        after (1)     after (2)           processing (3)   
    +------+------+                   
    | tag  | val  |  ct  delta  seq       seq-delta
    +------+------+                   
    |    1 |    3 |  15    0     1        1         ct : odd ct%2 <> 0  
    |    1 |    5 |  15    0     2        2         floor((ct+1)/2) : 8
    |    1 |    7 |  15    0     3        3         
    |    1 |   12 |  15    0     4        4         
    |    1 |   13 |  15    0     5        5
    |    1 |   14 |  15    0     6        6
    |    1 |   21 |  15    0     7        7
    |    1 |   23 |  15    0     8        8 ---> keep this line
    |    1 |   23 |  15    0     9        9 
    |    1 |   23 |  15    0     10       10
    |    1 |   23 |  15    0     11       11
    |    1 |   29 |  15    0     12       12
    |    1 |   39 |  15    0     13       13
    |    1 |   40 |  15    0     14       14
    |    1 |   56 |  15    0     15       15

    |    2 |    3 |  14    15    16        1         ct : even (ct%2 = 0  )
    |    2 |    5 |  14    15    17        2         floor((ct+1)/2) : 7
    |    2 |    7 |  14    15    18        3         floor((ct+1)/2)+1 : 8
    |    2 |   12 |  14    15    19        4
    |    2 |   13 |  14    15    20        5
    |    2 |   14 |  14    15    21        6
    |    2 |   21 |  14    15    22        7 ---> keep this line
    |    2 |   23 |  14    15    23        8 ---> keep this line
    |    2 |   23 |  14    15    24        9
    |    2 |   23 |  14    15    25        10
    |    2 |   23 |  14    15    26        11
    |    2 |   29 |  14    15    27        12
    |    2 |   40 |  14    15    28        13
    |    2 |   56 |  14    15    29        14

    |    3 |   12 |  3     29    30        1                  ct : odd ct%2 <> 0 
    |    3 |   15 |  3     29    31        2 ---> keep        floor((ct+1)/2) : 2
    |    3 |   43 |  3     29    32        3
    +------+------+
Run Code Online (Sandbox Code Playgroud)

(3) 之后的数据集

    +------+------+------+------+-------+
    | tag  | val  | ct   | seq  | delta |
    +------+------+------+------+-------+
    |    1 |   23 |   15 |    8 |     0 |
    |    2 |   21 |   14 |   22 |    15 |
    |    2 |   23 |   14 |   23 |    15 |
    |    3 |   15 |    3 |   31 |    29 |
    +------+------+------+------+-------+
Run Code Online (Sandbox Code Playgroud)

外部查询将按标签值计算 avg(val) 组。

希望这可以帮助。

但是当存在空值时中值计算呢?请参阅下面的 EDIT2

替代方法:使用函数

    DELIMITER //
    CREATE FUNCTION median(pTag int)
        RETURNS real
           READS SQL DATA
           DETERMINISTIC
           BEGIN
              DECLARE r real; -- result
    SELECT AVG(val) INTO r
    FROM 
    (
      SELECT val,
           (SELECT count(*) FROM median WHERE tag = pTag) as ct,
           seq
        FROM (SELECT val, @rownum := @rownum + 1 as seq
              FROM (SELECT * FROM median WHERE tag = pTag ORDER BY val ) t1 
              ORDER BY seq
            ) t3 
            CROSS JOIN (SELECT @rownum := 0) x
        HAVING (ct%2 = 0 and seq between floor((ct+1)/2) and floor((ct+1)/2) +1)
          or (ct%2 <> 0 and seq = (ct+1)/2)
    ) T;
    return r;
    END//
    DELIMITER ;
Run Code Online (Sandbox Code Playgroud)

但是会为每一行调用该函数:

SELECT tag, median(tag) FROM median; -- my test table is 'median' too...
Run Code Online (Sandbox Code Playgroud)

这个查询会“更好”:

select tag, median(tag) 
  from (select distinct tag from median) t;
Run Code Online (Sandbox Code Playgroud)

这就是我能做的!希望能帮助到你!

EDIT2:关于数据中的空值(示例中的列 val)

使用 WHERE 子句从源数据中省略空值显示: WHERE val IS NOT NULL在计算行数的 2 个子查询和获取数据的子查询中,。

EDIT3 (LAST EDIT) : 改变@rownum 位置的初始化

它应该放在最深的级别:以便它在查询的执行中最快地声明。

DELIMITER //
CREATE FUNCTION median(pTag int)
    RETURNS real
       READS SQL DATA
       DETERMINISTIC
       BEGIN
          DECLARE r real; -- result
SELECT AVG(val) INTO r
FROM 
(
  SELECT val,
       (SELECT count(*) FROM median WHERE tag = pTag and val is not null)     as ct,
       seq
    FROM (SELECT val, @rownum := @rownum + 1 as seq
          FROM (SELECT * FROM median 
                       CROSS JOIN (SELECT @rownum := 0) x -- INIT @rownum here
             WHERE tag = pTag and val is not null ORDER BY val 
          ) t1 
          ORDER BY seq
       ) t3     
    HAVING (ct%2 = 0 and seq between floor((ct+1)/2) and floor((ct+1)/2) +1)
      or (ct%2 <> 0 and seq = (ct+1)/2)
) T;
return r;
END//
DELIMITER ;
Run Code Online (Sandbox Code Playgroud)

查询也是如此。

再用 2 个数据集进行测试:

|    4 | NULL |
|    4 |   10 |
|    4 |   15 |
|    4 |   20 |
|    5 | NULL |
|    5 | NULL |
|    5 | NULL |
+------+------+
Run Code Online (Sandbox Code Playgroud)

39 行(0.00 秒)

+------+--------------+
| tag  | median2(tag) |
+------+--------------+
|    1 |           23 |
|    2 |           22 |
|    3 |           15 |
|    4 |           15 |
|    5 |         NULL |
+------+--------------+
5 rows in set (0.08 sec)
Run Code Online (Sandbox Code Playgroud)