在 Google BigQuery 中我有一个像这样的表:
开始时间:STRING,访客 ID:STRING,类别:STRING
此内容的示例:
startTime visitorId category
------------------- --------- --------
2013-11-27 00:00:00 A X
2013-11-27 05:00:00 A X
2013-11-27 07:00:00 B X
2013-11-28 08:00:00 C X
Run Code Online (Sandbox Code Playgroud)
我希望得到以下结果:
day category runningCountOfDistinctVisitors
--------- -------- ------------------------------
2013-11-27 X 2
2013-11-28 X 3
Run Code Online (Sandbox Code Playgroud)
我已经尝试过以下查询,但它似乎不起作用(它已经在 120 万行表上运行了 3 个多小时,但仍未完成):
SELECT left(a.startTime,10) as day,
a.category,
count(distinct a.visitorId) as runningCountOfDistinctVisitors
FROM [MyDataset.MyTable] a
LEFT JOIN EACH [MyDataset.MyTable] b ON a.category = b.category
WHERE left(b.startTime,10) < left(a.startTime,10)
GROUP EACH BY a.category, day
ORDER BY a.category, day
Run Code Online (Sandbox Code Playgroud)
我还尝试使用分区函数,但似乎不支持不同计数。
尝试这个:
ts:时间戳、访问者:字符串、类别:字符串
ts visitor category
----------------------- ------- --------
2013-11-27 00:00:00 UTC A X
2013-11-27 00:00:00 UTC A X
2013-11-27 00:00:00 UTC B X
2013-11-28 00:00:00 UTC C X
2013-11-27 00:00:00 UTC A Y
2013-11-28 00:00:00 UTC B Y
2013-11-29 00:00:00 UTC C Y
Run Code Online (Sandbox Code Playgroud)
询问:
select
day, category, sum(cd)
over
(partition by category order by day) as running_total
from (select date(ts) as day, category, count(distinct visitor) as cd from
[test.runningtotal] group by day, category)
Run Code Online (Sandbox Code Playgroud)
这将产生:
day category running_total
---------- -------- -------------
2013-11-27 X 2
2013-11-28 X 3
2013-11-27 Y 1
2013-11-28 Y 2
2013-11-29 Y 3
Run Code Online (Sandbox Code Playgroud)
我没有在大型数据集上测试它,但它可能比 JOIN 解决方案更快。
| 归档时间: |
|
| 查看次数: |
10929 次 |
| 最近记录: |