获取 BigQuery 中每个 ID 的最新行的可扩展解决方案

S.M*_* sh 11 sql google-bigquery

我有一个很大的表,有一个字段ID,另一个字段为collection_time. 我想为每个 ID 选择最新的记录。不幸的是,(ID, collection_time)时间的组合在我的数据中并不是唯一的。我只想要一个最大的记录collection time。我尝试了两种解决方案,但没有一种对我有用:

第一:使用查询

SELECT *  FROM 
(SELECT *, ROW_NUMBER() OVER (PARTITION BY ID ORDER BY collection_time) as rn 
FROM mytable)  where rn=1
Run Code Online (Sandbox Code Playgroud)

这导致Resources exceeded错误,我猜是因为ORDER BY在查询中。

第二 在表和最新时间​​之间使用连接:

(SELECT tab1.* 
FROM mytable AS tab1
INNER JOIN EACH 
(SELECT ID, MAX(collection_time) AS second_time 
FROM mytable GROUP EACH BY ID) AS tab2
ON tab1.ID=tab2.ID AND tab1.collection_time=tab2.second_time) 
Run Code Online (Sandbox Code Playgroud)

此解决方案对我不起作用,因为(ID, collection_time)它们不是唯一的,因此JOIN每个ID.

我想知道是否有解决 resourcesExceeded 错误的方法,或者是否有适用于我的情况的不同查询?

Ser*_*ron 8

简短且可扩展的版本:

select array_agg(t order by collection_time desc limit 1)[offset(0)].*
from mytable t
group by t.id;
Run Code Online (Sandbox Code Playgroud)


Rub*_*oot 8

我发现没有人提到过窗口函数QUALIFY

SELECT *, MAX(collection_time) OVER (PARTITION BY id) AS max_timestamp
FROM my_table
QUALIFY collection_time = max_timestamp
Run Code Online (Sandbox Code Playgroud)

窗口函数添加一个max_timestamp可在QUALIFY子句中访问的列以进行过滤。


小智 7

SELECT
  agg.table.*
FROM (
  SELECT
    id,
    ARRAY_AGG(STRUCT(table)
    ORDER BY
      collection_time DESC)[SAFE_OFFSET(0)] agg
  FROM
    `dataset.table` table
  GROUP BY
    id)
Run Code Online (Sandbox Code Playgroud)

这将为您完成这项工作,并且考虑到架构不断变化这一事实是可扩展的,您不必更改它


Mik*_*ant 5

快速和肮脏的选项 - 将您的两个查询合并为一个 - 首先使用最新的 collection_time 获取所有记录(使用您的第二个查询),然后使用您的第一个查询删除它们:

SELECT * FROM (
  SELECT *, ROW_NUMBER() OVER (PARTITION BY tab1.ID) AS rn 
  FROM (
    SELECT tab1.* 
    FROM mytable AS tab1
    INNER JOIN (
      SELECT ID, MAX(collection_time) AS second_time 
      FROM mytable GROUP BY ID
    ) AS tab2
    ON tab1.ID=tab2.ID AND tab1.collection_time=tab2.second_time
  )
)
WHERE rn = 1  
Run Code Online (Sandbox Code Playgroud)

并使用标准 SQL(由 S.Mohsen sh 提出)

WITH myTable AS (
  SELECT 1 AS ID, 1 AS collection_time
),
tab1 AS (
  SELECT ID,
  MAX(collection_time) AS second_time 
  FROM myTable GROUP BY ID
),
tab2 AS (
  SELECT * FROM myTable
),
joint AS (
  SELECT tab2.* 
  FROM tab2 INNER JOIN tab1
  ON tab2.ID=tab1.ID AND tab2.collection_time=tab1.second_time 
)
SELECT * EXCEPT(rn) 
FROM (
  SELECT *, ROW_NUMBER() OVER (PARTITION BY ID) AS rn 
  FROM joint
)
WHERE rn=1
Run Code Online (Sandbox Code Playgroud)