如何有效地计算Google BigQuery中数字序列的中位数?

Man*_*wal 7 median google-bigquery

我需要有效地计算Google BigQuery中数字序列的中值.有可能吗?

Pen*_*m10 12

是的,可以使用PERCENTILE_CONT窗口功能.

根据ORDER BY子句对它们进行排序后,返回基于组值之间的线性插值的值.

必须介于0和1之间.

该窗口函数在OVER子句中需要ORDER BY.

所以一个示例查询就像(max()只是在整个组中工作,但它不是用作数学逻辑,不应该混淆你)

SELECT room,
      max(median) FROM   (SELECT room,
         percentile_cont(0.5) OVER (PARTITION BY room
                                    ORDER BY temperature) AS median    FROM
    (SELECT 1 AS room,
            11 AS temperature),
    (SELECT 1 AS room,
            12 AS temperature),
    (SELECT 1 AS room,
            14 AS temperature),
    (SELECT 1 AS room,
            19 AS temperature),
    (SELECT 1 AS room,
            13 AS temperature),
    (SELECT 2 AS room,
            20 AS temperature),
    (SELECT 2 AS room,
            21 AS temperature),
    (SELECT 2 AS room,
            29 AS temperature),
    (SELECT 3 AS room,
            30 AS temperature)) GROUP BY room
Run Code Online (Sandbox Code Playgroud)

返回:

+------+-------------+
| room | temperature |
+------+-------------+
|    1 |          13 |
|    2 |          21 |
|    3 |          30 |
+------+-------------+
Run Code Online (Sandbox Code Playgroud)

  • 我们可以有一个更清晰简洁的查询吗?我无法理解以上内容。 (2认同)

Mos*_*sky 7

替代解决方案,当您不需要绝对精确的结果并且近似很好时 - 您可以使用NTH和QUANTILES聚合函数的组合.这种方法的优点是它比分析窗函数更具可扩展性,但缺点是它给出了近似的结果.

SELECT room,
       NTH(50, QUANTILES(temperature, 101)) FROM
    (SELECT 1 AS room,
            11 AS temperature),
    (SELECT 1 AS room,
            12 AS temperature),
    (SELECT 1 AS room,
            14 AS temperature),
    (SELECT 1 AS room,
            19 AS temperature),
    (SELECT 1 AS room,
            13 AS temperature),
    (SELECT 2 AS room,
            20 AS temperature),
    (SELECT 2 AS room,
            21 AS temperature),
    (SELECT 2 AS room,
            29 AS temperature),
    (SELECT 3 AS room,
            30 AS temperature) GROUP BY room
Run Code Online (Sandbox Code Playgroud)

这回来了

room temperature 
1    13  
2    21  
3    30
Run Code Online (Sandbox Code Playgroud)


Fel*_*ffa 5

2018年更新的更多指标:

BigQuery SQL:平均值,几何平均值,移除离群值,中位数


出于我自己的记忆目的,使用出租车数据查询:

近似分位数:

SELECT MONTH(pickup_datetime) month, NTH(51, QUANTILES(tip_amount,101)) median
FROM [nyc-tlc:green.trips_2015]
WHERE tip_amount > 0
GROUP BY 1
ORDER BY 1
Run Code Online (Sandbox Code Playgroud)

给出与PERCENTILE_DISC相同的结果:

SELECT month, FIRST(median) median
FROM (
  SELECT MONTH(pickup_datetime) month, tip_amount, PERCENTILE_DISC(0.5) OVER(PARTITION BY month ORDER BY tip_amount) median
  FROM [nyc-tlc:green.trips_2015]
  WHERE tip_amount > 0
)
GROUP BY 1
ORDER BY 1
Run Code Online (Sandbox Code Playgroud)

标准SQL:

#StandardSQL
SELECT DATE_TRUNC(DATE(pickup_datetime), MONTH) month, APPROX_QUANTILES(tip_amount,1000)[OFFSET(500)] median
FROM `nyc-tlc.green.trips_2015`
WHERE tip_amount > 0
GROUP BY 1
ORDER BY 1
Run Code Online (Sandbox Code Playgroud)