有没有更好的方法来计算中位数(不是平均值)

Ghi*_*que 14 sql postgresql aggregate-functions

假设我有以下表定义:

CREATE TABLE x (i serial primary key, value integer not null);
Run Code Online (Sandbox Code Playgroud)

我想计算value(不是AVG)的MEDIAN .中位数是在包含相同数量元素的两个子集中划分集合的值.如果元素的数量是偶数,则中位数是最低段中的最大值和最大段的最低值的平均值.(有关详细信息,请参阅维基百科.)

以下是我如何计算MEDIAN,但我想必须有更好的方法:

SELECT AVG(values_around_median) AS median
  FROM (
    SELECT
       DISTINCT(CASE WHEN FIRST_VALUE(above) OVER w2 THEN MIN(value) OVER w3 ELSE MAX(value) OVER w2 END)
        AS values_around_median
      FROM (
        SELECT LAST_VALUE(value) OVER w AS value,
               SUM(COUNT(*)) OVER w > (SELECT count(*)/2 FROM x) AS above
          FROM x
          GROUP BY value
          WINDOW w AS (ORDER BY value)
          ORDER BY value
        ) AS find_if_values_are_above_or_below_median
      WINDOW w2 AS (PARTITION BY above ORDER BY value DESC),
             w3 AS (PARTITION BY above ORDER BY value ASC)
    ) AS find_values_around_median
Run Code Online (Sandbox Code Playgroud)

有任何想法吗?

Luk*_*der 23

是的,使用PostgreSQL 9.4,您可以使用新引入的逆分布函数PERCENTILE_CONT(),这是一个在SQL标准中指定的有序集合函数.

WITH t(value) AS (
  SELECT 1   UNION ALL
  SELECT 2   UNION ALL
  SELECT 100 
)
SELECT
  percentile_cont(0.5) WITHIN GROUP (ORDER BY value)
FROM
  t;
Run Code Online (Sandbox Code Playgroud)

此处还记录了MEDIAN()via的仿真PERCENTILE_CONT().


Sco*_*ley 15

确实有一种更简单的方法.在Postgres中,您可以定义自己的聚合函数.我发布了函数,以便在一段时间后对PostgreSQL片段库进行中位数以及模式和范围.

http://wiki.postgresql.org/wiki/Aggregate_Median


Erw*_*ter 7

一个更简单的查询:

WITH y AS (
   SELECT value, row_number() OVER (ORDER BY value) AS rn
   FROM   x
   WHERE  value IS NOT NULL
   )
, c AS (SELECT count(*) AS ct FROM y) 
SELECT CASE WHEN c.ct%2 = 0 THEN
          round((SELECT avg(value) FROM y WHERE y.rn IN (c.ct/2, c.ct/2+1)), 3)
       ELSE
                (SELECT     value  FROM y WHERE y.rn = (c.ct+1)/2)
       END AS median
FROM   c;
Run Code Online (Sandbox Code Playgroud)

主要观点

  • 忽略NULL值.
  • 核心功能是row_number()窗口函数,自8.4版本以来一直存在
  • avg()对于偶数,最后的SELECT得到一行,对于偶数,得到两行.结果为数字,四舍五入到小数点后3位.

测试显示,新版本比问题中的查询快4倍(并产生正确的结果):

CREATE TEMP TABLE x (value int);
INSERT INTO x SELECT generate_series(1,10000);
INSERT INTO x VALUES (NULL),(NULL),(NULL),(3);
Run Code Online (Sandbox Code Playgroud)