我有一个表包含每次访问端点的行.表看起来像这样:
user_id STRING
endpoint_id STRING
created_at TIMESTAMP
Run Code Online (Sandbox Code Playgroud)
示例数据:
user-1, endpoint-1, 2016-01-01 01:01:01 UTC
user-2, endpoint-1, 2016-01-01 01:01:01 UTC
user-1, endpoint-2, 2016-01-02 01:01:01 UTC
user-1, endpoint-1, 2016-01-02 01:01:01 UTC
user-1, endpoint-1, 2016-01-03 01:01:01 UTC
Run Code Online (Sandbox Code Playgroud)
如何获得每个用户和资源的第一次访问行.
构造此类查询的最佳方法是什么?
预期结果:
user-1, endpoint-1, 2016-01-01 01:01:01 UTC
user-2, endpoint-1, 2016-01-01 01:01:01 UTC
user-1, endpoint-2, 2016-01-02 01:01:01 UTC
Run Code Online (Sandbox Code Playgroud)
这是我想出来的,但是这个查询不适用于大量数据.我使用窗口函数将重复用户/资源行组合在一起:
SELECT
user_id,
endpoint_id,
created_at
FROM (
SELECT
poll_id,
endpoint_id,
created_at,
FIRST_VALUE(created_at) OVER (PARTITION BY user_id, endpoint_id ORDER BY created_at DESC) AS first_created_at
FROM
[visits]
)
WHERE
created_at = first_created_at
Run Code Online (Sandbox Code Playgroud)
如何获得每个用户和资源的第一次访问行?
在查询您的问题提出了-应该删除DESC的ORDER BY created_at DESC,否则它返回上次访问-不是第一
构造此类查询的最佳方法是什么?
另一种选择是使用如下的ROW_NUMBER()
SELECT
user_id,
endpoint_id,
created_at
FROM (
SELECT
user_id,
endpoint_id,
created_at,
ROW_NUMBER() OVER(PARTITION BY user_id, endpoint_id ORDER BY created_at) AS first_created
FROM [visits]
)
WHERE first_created = 1
Run Code Online (Sandbox Code Playgroud)
...但此查询不适用于大量数据
这真的取决于.Resources Exceeded可能发生如果您的user_id, endpoint_id分区大小足够大(因为ORDER BY要求所有分区行都在同一节点上).
如果这是你的情况 - 你可以在下面使用
trick
第1步 - 使用 JOIN
SELECT tab1.user_id AS user_id, tab1.endpoint_id AS endpoint_id, tab1.created_at AS created_at
FROM [visits] AS tab1
INNER JOIN (
SELECT user_id, endpoint_id, MIN(created_at) AS min_time
FROM [visits]
GROUP BY user_id, endpoint_id
) AS tab2
ON tab1.user_id = tab2.user_id
AND tab1.endpoint_id = tab2.endpoint_id
AND tab1.created_at = tab2.min_time
Run Code Online (Sandbox Code Playgroud)
第2步 - 此处还有其他事项需要注意 - 如果您有相同用户/资源的重复条目.在这种情况下,您仍然需要为每个分区仅提取一行.见下面的最终查询
SELECT user_id, endpoint_id, created_at
FROM (
SELECT user_id, endpoint_id, created_at,
ROW_NUMBER() OVER (PARTITION BY user_id, endpoint_id) AS rn
FROM (
SELECT tab1.user_id AS user_id, tab1.endpoint_id AS endpoint_id, tab1.created_at AS created_at
FROM [visits] AS tab1
INNER JOIN (
SELECT user_id, endpoint_id, MIN(created_at) AS min_time
FROM [visits]
GROUP BY user_id, endpoint_id
) AS tab2
ON tab1.user_id = tab2.user_id
AND tab1.endpoint_id = tab2.endpoint_id
AND tab1.created_at = tab2.min_time
)
)
WHERE rn = 1
Run Code Online (Sandbox Code Playgroud)
当然,显而易见且最简单的情况 - 如果这三个字段是[visits]表中的唯一字段
SELECT user_id, endpoint_id, MIN(created_at) AS created_at
FROM [visits]
GROUP BY user_id, endpoint_id
Run Code Online (Sandbox Code Playgroud)
您现在可以使用qualify更简洁的解决方案:
select
user_id,
endpoint_id,
created_at,
from [visits]
where true
qualify ROW_NUMBER() OVER(PARTITION BY user_id, endpoint_id ORDER BY created_at) = 1
Run Code Online (Sandbox Code Playgroud)
我有另一个解决方案,可以避免使用窗口函数(我认为在 BQ 中速度很慢)以及子查询(这会增加复杂性):
select
group_column
,array_agg(t order by time_column asc limit 1)[safe_offset(0)] AS first_row
from table AS t
group by 1
Run Code Online (Sandbox Code Playgroud)
array_agg 返回一个数组,其中包含每组第一行的结构。这是通过 [offset(0)] 从数组中提取的。您可以使用first_row.column_1进一步从结构中提取。或者您可以在其周围包含一个 select 语句以从结构中提取列:
select first_row.* from (
select
group_column
,array_agg(t order by time_column asc limit 1)[safe_offset(0)] AS
first_row
from table AS t
group by 1
)
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
6179 次 |
| 最近记录: |