每组的第一行

Pio*_*ski 6 google-bigquery

我有一个表包含每次访问端点的行.表看起来像这样:

user_id STRING
endpoint_id STRING
created_at TIMESTAMP
Run Code Online (Sandbox Code Playgroud)

示例数据:

user-1, endpoint-1, 2016-01-01 01:01:01 UTC
user-2, endpoint-1, 2016-01-01 01:01:01 UTC
user-1, endpoint-2, 2016-01-02 01:01:01 UTC
user-1, endpoint-1, 2016-01-02 01:01:01 UTC
user-1, endpoint-1, 2016-01-03 01:01:01 UTC
Run Code Online (Sandbox Code Playgroud)

如何获得每个用户和资源的第一次访问行.

构造此类查询的最佳方法是什么?

预期结果:

user-1, endpoint-1, 2016-01-01 01:01:01 UTC
user-2, endpoint-1, 2016-01-01 01:01:01 UTC
user-1, endpoint-2, 2016-01-02 01:01:01 UTC
Run Code Online (Sandbox Code Playgroud)

这是我想出来的,但是这个查询不适用于大量数据.我使用窗口函数将重复用户/资源行组合在一起:

SELECT
    user_id,
    endpoint_id,
    created_at
FROM (
    SELECT 
        poll_id, 
        endpoint_id, 
        created_at,
        FIRST_VALUE(created_at) OVER (PARTITION BY user_id, endpoint_id ORDER BY created_at DESC) AS first_created_at
    FROM 
        [visits]
    )
WHERE
    created_at = first_created_at
Run Code Online (Sandbox Code Playgroud)

Mik*_*ant 8

如何获得每个用户和资源的第一次访问行?

在查询您的问题提出了-应该删除DESCORDER BY created_at DESC,否则它返回上次访问-不是第一

构造此类查询的最佳方法是什么?

另一种选择是使用如下的ROW_NUMBER()

 SELECT
  user_id,
  endpoint_id,
  created_at
FROM (
  SELECT 
      user_id, 
      endpoint_id, 
      created_at,
      ROW_NUMBER() OVER(PARTITION BY user_id, endpoint_id ORDER BY created_at) AS first_created
  FROM [visits]
)
WHERE first_created = 1
Run Code Online (Sandbox Code Playgroud)

...但此查询不适用于大量数据

这真的取决于.Resources Exceeded可能发生如果您的user_id, endpoint_id分区大小足够大(因为ORDER BY要求所有分区行都在同一节点上).

如果这是你的情况 - 你可以在下面使用 trick

第1步 - 使用 JOIN

SELECT tab1.user_id AS user_id, tab1.endpoint_id AS endpoint_id, tab1.created_at AS created_at 
FROM [visits] AS tab1
INNER JOIN (
  SELECT user_id, endpoint_id, MIN(created_at) AS min_time 
  FROM [visits] 
  GROUP BY user_id, endpoint_id
) AS tab2
ON  tab1.user_id = tab2.user_id 
AND tab1.endpoint_id = tab2.endpoint_id 
AND tab1.created_at = tab2.min_time  
Run Code Online (Sandbox Code Playgroud)

第2步 - 此处还有其他事项需要注意 - 如果您有相同用户/资源的重复条目.在这种情况下,您仍然需要为每个分区仅提取一行.见下面的最终查询

 SELECT user_id, endpoint_id, created_at
FROM (
  SELECT user_id, endpoint_id, created_at, 
    ROW_NUMBER() OVER (PARTITION BY user_id, endpoint_id) AS rn 
  FROM (
    SELECT tab1.user_id AS user_id, tab1.endpoint_id AS endpoint_id, tab1.created_at AS created_at 
    FROM [visits]  AS tab1
    INNER JOIN (
      SELECT user_id, endpoint_id, MIN(created_at) AS min_time 
      FROM [visits]  
      GROUP BY user_id, endpoint_id
    ) AS tab2
    ON  tab1.user_id = tab2.user_id 
    AND tab1.endpoint_id = tab2.endpoint_id 
    AND tab1.created_at = tab2.min_time
  )
)
WHERE rn = 1  
Run Code Online (Sandbox Code Playgroud)

当然,显而易见且最简单的情况 - 如果这三个字段是[visits]表中的唯一字段

SELECT user_id, endpoint_id, MIN(created_at) AS created_at 
FROM [visits]
GROUP BY user_id, endpoint_id
Run Code Online (Sandbox Code Playgroud)


Dav*_*sip 7

您现在可以使用qualify更简洁的解决方案:

  select 
      user_id, 
      endpoint_id, 
      created_at,
  from [visits]
  where true
  qualify ROW_NUMBER() OVER(PARTITION BY user_id, endpoint_id ORDER BY created_at) = 1
Run Code Online (Sandbox Code Playgroud)


Joh*_*y V 6

我有另一个解决方案,可以避免使用窗口函数(我认为在 BQ 中速度很慢)以及子查询(这会增加复杂性):

select
   group_column
   ,array_agg(t order by time_column asc limit 1)[safe_offset(0)] AS first_row
from table AS t
group by 1
Run Code Online (Sandbox Code Playgroud)

array_agg 返回一个数组,其中包含每组第一行的结构。这是通过 [offset(0)] 从数组中提取的。您可以使用first_row.column_1进一步从结构中提取。或者您可以在其周围包含一个 select 语句以从结构中提取列:

select first_row.* from (
  select
     group_column
     ,array_agg(t order by time_column asc limit 1)[safe_offset(0)] AS 
  first_row
  from table AS t
  group by 1
)
Run Code Online (Sandbox Code Playgroud)