Max*_*f23 4 sql window-functions google-bigquery
我\xe2\x80\x99m 使用BigQuery USA Facts Covid-19 开放数据集开展项目。数据如下所示:\n
I\xe2\x80\x99m 尝试创建一个查询,该查询为我提供按县划分的 7 天 covid 病例数据的百分比变化(向上或向下)。最终结果将是县、日期以及新冠病例 7 天移动平均值的百分比变化。最终,这将使我能够展示哪些病例相对稳定,哪些病例正在增加,也就是热点。
\n我是 LAG 和 OVER 的新手。所以我很确定我只是在 CTE 中缺少一些基本的 order by 或 group by 。
\n这\xe2\x80\x99很奇怪,因为当我只选择一个县(其中county_name=\xe2\x80\x9dX\xe2\x80\x9d)时,我能够得到7天移动平均值——它只是给我每天一个很好的百分比,告诉我它是增加还是减少。问题是,当我不只选择一个县时,我只是无法弄清楚我需要做什么或需要更改什么才能仍然获得相同的值。我最终得到了毫无意义的值。我很确定这是因为我错误地使用了窗口函数。
\n这里\xe2\x80\x99是我的代码:
\n\nWITH \na AS (SELECT long.*,\ndeaths-lag(deaths) over (order by date) as deaths_increase,\nconfirmed_cases - lag(confirmed_cases) over (order by date) as cases_increase,\nFROM `bigquery-public-data.covid19_usafacts.summary` as long\nwhere date >= cast(\'2020-05-03\' as date)\n\n\n)\n,b as (\n SELECT\n a.*,\n AVG(a.deaths_increase) OVER(ORDER BY a.date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS seven_day_avg_deaths,\n AVG(a.cases_increase) OVER(ORDER BY a.date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS seven_day_avg_cases\nFROM a\norder by a.county_name\n\n)\n\nselect \nb.county_name, \nb.county_fips_code,\nb.confirmed_cases,\nb.cases_increase,\nb.deaths,\nb.state,\nb.seven_day_avg_cases,\nb.date,\n\n(b.seven_day_avg_cases - lag(b.seven_day_avg_cases) OVER( ORDER BY b.date)) / b.seven_day_avg_cases * 100 as seven_day_percent_change\n\nfrom b\n\nwhere seven_day_avg_cases > 0\n\n\norder by date desc\nRun Code Online (Sandbox Code Playgroud)\n
以下是 BigQuery 标准 SQL
您应该PARTITION BY county_name在查询中添加 ALL OVER(...) 语句
之后,您的查询可能如下所示
#standardSQL
WITH a AS (
SELECT long.*,
deaths-lag(deaths) OVER(PARTITION BY county_name ORDER BY DATE) AS deaths_increase,
confirmed_cases - LAG(confirmed_cases) OVER (PARTITION BY county_name ORDER BY DATE) AS cases_increase,
FROM `bigquery-public-data.covid19_usafacts.summary` AS long
WHERE DATE >= CAST('2020-05-03' AS DATE)
), b AS (
SELECT a.*,
AVG(a.deaths_increase) OVER(PARTITION BY county_name ORDER BY a.date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS seven_day_avg_deaths,
AVG(a.cases_increase) OVER(PARTITION BY county_name ORDER BY a.date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS seven_day_avg_cases
FROM a
)
SELECT
b.county_name,
b.county_fips_code,
b.confirmed_cases,
b.cases_increase,
b.deaths,
b.state,
b.seven_day_avg_cases,
b.date,
(b.seven_day_avg_cases - LAG(b.seven_day_avg_cases) OVER(PARTITION BY county_name ORDER BY b.date)) / b.seven_day_avg_cases * 100 AS seven_day_percent_change
FROM b
WHERE seven_day_avg_cases > 0
ORDER BY DATE DESC, county_name
Run Code Online (Sandbox Code Playgroud)
注意:显然假设您的原始查询确实适用于一个国家/地区
查询中的另一个弱点是-这ORDER BY a.date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW设置了 7 个连续行(而不是天)的窗口,这意味着只有当您在统计数据中包含所有天时才有效 - 这很可能是此数据的情况。但更正确的用法是使用ORDER BY UNIX_DATE(a.date) RANGE BETWEEN 6 PRECEDING AND CURRENT ROW替代 - 这可以保证您使用 7 天的窗口,即使某些天由于某种原因丢失或被过滤掉,等等。
| 归档时间: |
|
| 查看次数: |
6123 次 |
| 最近记录: |