使用 BigQuery 结合 WHERE 子句查找具有标准差结果的异常值

Mic*_*hri 1 statistics standard-deviation google-bigquery

标准差分析是发现异常值的有用方法。有没有办法合并这个查询的结果(找到远离平均值的第四个标准差的值)......

SELECT (AVG(weight_pounds) + STDDEV(weight_pounds) * 4) as high FROM [publicdata:samples.natality];
Run Code Online (Sandbox Code Playgroud)

结果 = 12.721342001626912

...进入另一个查询,该查询生成有关哪些州和日期的婴儿出生体重超过平均值 4 个标准差的信息的查询?

SELECT state, year, month ,COUNT(*) AS outlier_count
 FROM [publicdata:samples.natality]
WHERE
  (weight_pounds > 12.721342001626912)
AND
  (state != '' AND state IS NOT NULL)
GROUP BY state, year, month 
ORDER BY outlier_count DESC;
Run Code Online (Sandbox Code Playgroud)

结果:

Row  state   year    month   outlier_count    
1    MD  1990    12  22   
2    NY  1989    10  17   
3    CA  1991    9   14
Run Code Online (Sandbox Code Playgroud)

从本质上讲,将其合并为一个查询会很棒。

Jor*_*ani 6

您可以为此滥用 JOIN(因此性能会受到影响):

SELECT n.state, n.year, n.month ,COUNT(*) AS outlier_count
FROM (
  SELECT state, year, month, weight_pounds, 1 as key 
  FROM [publicdata:samples.natality]) as n
JOIN (
  SELECT (AVG(weight_pounds) + STDDEV(weight_pounds) * 4) as giant_baby, 
          1 as key 
  FROM [publicdata:samples.natality]) as o
ON n.key = o.key
WHERE
  (n.weight_pounds > o.giant_baby)
AND
  (n.state != '' AND n.state IS NOT NULL)
GROUP BY n.state, n.year, n.month 
ORDER BY outlier_count DESC;
Run Code Online (Sandbox Code Playgroud)

  • 我认为这是正确的...但无论正确与否,我都为“giant_baby”别名+1,这仍然让我在键入此内容时傻笑。 (4认同)