Gui*_*lle 4 user-defined-functions google-bigquery
我们一直在努力在(标准 sql)BigQuery 中循环数据,但没有成功。
我不确定它是否是 sql 支持的功能,我们对问题的理解,或者我们想要在 BigQuery 中执行此操作的方式。
无论如何,假设我们有一个事件表,其中每个事件都由用户 ID 和日期描述(同一用户 ID 在同一日期可能有许多事件)
id STRING
dt DATE
Run Code Online (Sandbox Code Playgroud)
我们想知道的一件事是在给定的时间段内有多少不同的用户生成了事件。这是相当微不足道的,只是表上的一个 COUNT,以句点作为 WHERE 子句中的约束。例如,如果我们有四个月的时间段:
SELECT
COUNT(DISTINCT id) AS total
FROM
`events`
WHERE
dt BETWEEN DATE_ADD(CURRENT_DATE(), INTERVAL -4 MONTH)
AND CURRENT_DATE()
Run Code Online (Sandbox Code Playgroud)
但是,如果我们希望在相同的给定时间段内递归地获取其他天(或周)的历史记录,就会出现问题。例如,昨天,前天,等等......直到......例如,3个月前。所以这里的变量将是 CURRENT_DATE() ,它可以回溯一天或任何一个因素,但间隔保持不变(在我们的例子中是 4 个月)。我们期待这样的事情(一天的因素):
2017-07-14 2017-03-14 1760333
2017-07-13 2017-03-13 1856333
2017-07-12 2017-03-12 2031993
...
2017-04-14 2017-01-14 1999352
Run Code Online (Sandbox Code Playgroud)
这只是对同一张桌子上的每一天、每周等进行循环,然后对这段时间内发生的不同事件进行计数。但是我们不能在 BigQuery 中进行“循环”。
One way we thought was a JOIN, and then a COUNT on the GROUP BY intervals (taking advantage of the HAVING clause to simulate the period from a given day back to 4 months), but this is very inefficient and it just doesn't ever finish considering table's size (which has around 254 million records, 173 GB as of today, and it just keeps growing every day).
Another way we thought was using UDFs with the idea that we feed a list of date intervals to the function and then we function would apply the naive query (for counting) for every interval returning the interval and the count for that interval. But... UDFs in BigQuery do not support accessing tables within the UDF so we would have to sort of feed the whole table to the UDF which we haven't tried but doesn't seem reasonable.
因此,我们没有想到基本上迭代相同数据并对 BigQuery 中的部分数据(如您所见的重叠部分)进行计算的解决方案,我们唯一的解决方案是在 BigQuery 之外执行此操作(最后是循环功能)。
有没有办法或有人可以想出一种方法来在 BigQuery 中完成这一切?我们的目标是将其作为 BigQuery 内部的视图提供,以便它不依赖于需要以我们设置的频率(天/周/等...)触发的外部系统。
下面是 BigQuery Standard SQL 的这种技术示例
#standardSQL
SELECT
DAY,
COUNT(CASE WHEN period = 7 THEN id END) AS days_07,
COUNT(CASE WHEN period = 14 THEN id END) AS days_14,
COUNT(CASE WHEN period = 30 THEN id END) AS days_30
FROM (
SELECT
dates.day AS DAY,
periods.period AS period,
id
FROM yourTable AS activity
CROSS JOIN (SELECT DAY FROM yourTable GROUP BY DAY) AS dates
CROSS JOIN (SELECT period FROM (SELECT 7 AS period UNION ALL
SELECT 14 AS period UNION ALL SELECT 30 AS period)) AS periods
WHERE dates.day >= activity.day
AND CAST(DATE_DIFF(dates.day, activity.day, DAY) / periods.period AS INT64) = 0
GROUP BY 1,2,3
)
GROUP BY DAY
-- ORDER BY DAY
Run Code Online (Sandbox Code Playgroud)
您可以使用下面的虚拟数据播放/测试此示例
#standardSQL
WITH data AS (
SELECT
DAY, CAST(10 * RAND() AS INT64) AS id
FROM UNNEST(GENERATE_DATE_ARRAY('2017-01-01', '2017-07-13')) AS DAY
)
SELECT
DAY,
COUNT(DISTINCT CASE WHEN period = 7 THEN id END) AS days_07,
COUNT(DISTINCT CASE WHEN period = 14 THEN id END) AS days_14,
COUNT(DISTINCT CASE WHEN period = 30 THEN id END) AS days_30
FROM (
SELECT
dates.day AS DAY,
periods.period AS period,
id
FROM data AS activity
CROSS JOIN (SELECT DAY FROM data GROUP BY DAY) AS dates
CROSS JOIN (SELECT period FROM (SELECT 7 AS period UNION ALL
SELECT 14 AS period UNION ALL SELECT 30 AS period)) AS periods
WHERE dates.day >= activity.day
AND CAST(DATE_DIFF(dates.day, activity.day, DAY) / periods.period AS INT64) = 0
GROUP BY 1,2,3
)
GROUP BY DAY
ORDER BY DAY
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
23825 次 |
| 最近记录: |