S.M*_* sh 11 sql google-bigquery
我有一个很大的表,有一个字段ID,另一个字段为collection_time. 我想为每个 ID 选择最新的记录。不幸的是,(ID, collection_time)时间的组合在我的数据中并不是唯一的。我只想要一个最大的记录collection time。我尝试了两种解决方案,但没有一种对我有用:
第一:使用查询
SELECT * FROM
(SELECT *, ROW_NUMBER() OVER (PARTITION BY ID ORDER BY collection_time) as rn
FROM mytable) where rn=1
Run Code Online (Sandbox Code Playgroud)
这导致Resources exceeded错误,我猜是因为ORDER BY在查询中。
第二 在表和最新时间之间使用连接:
(SELECT tab1.*
FROM mytable AS tab1
INNER JOIN EACH
(SELECT ID, MAX(collection_time) AS second_time
FROM mytable GROUP EACH BY ID) AS tab2
ON tab1.ID=tab2.ID AND tab1.collection_time=tab2.second_time)
Run Code Online (Sandbox Code Playgroud)
此解决方案对我不起作用,因为(ID, collection_time)它们不是唯一的,因此JOIN每个ID.
我想知道是否有解决 resourcesExceeded 错误的方法,或者是否有适用于我的情况的不同查询?
简短且可扩展的版本:
select array_agg(t order by collection_time desc limit 1)[offset(0)].*
from mytable t
group by t.id;
Run Code Online (Sandbox Code Playgroud)
我发现没有人提到过窗口函数QUALIFY:
SELECT *, MAX(collection_time) OVER (PARTITION BY id) AS max_timestamp
FROM my_table
QUALIFY collection_time = max_timestamp
Run Code Online (Sandbox Code Playgroud)
窗口函数添加一个max_timestamp可在QUALIFY子句中访问的列以进行过滤。
小智 7
SELECT
agg.table.*
FROM (
SELECT
id,
ARRAY_AGG(STRUCT(table)
ORDER BY
collection_time DESC)[SAFE_OFFSET(0)] agg
FROM
`dataset.table` table
GROUP BY
id)
Run Code Online (Sandbox Code Playgroud)
这将为您完成这项工作,并且考虑到架构不断变化这一事实是可扩展的,您不必更改它
快速和肮脏的选项 - 将您的两个查询合并为一个 - 首先使用最新的 collection_time 获取所有记录(使用您的第二个查询),然后使用您的第一个查询删除它们:
SELECT * FROM (
SELECT *, ROW_NUMBER() OVER (PARTITION BY tab1.ID) AS rn
FROM (
SELECT tab1.*
FROM mytable AS tab1
INNER JOIN (
SELECT ID, MAX(collection_time) AS second_time
FROM mytable GROUP BY ID
) AS tab2
ON tab1.ID=tab2.ID AND tab1.collection_time=tab2.second_time
)
)
WHERE rn = 1
Run Code Online (Sandbox Code Playgroud)
并使用标准 SQL(由 S.Mohsen sh 提出)
WITH myTable AS (
SELECT 1 AS ID, 1 AS collection_time
),
tab1 AS (
SELECT ID,
MAX(collection_time) AS second_time
FROM myTable GROUP BY ID
),
tab2 AS (
SELECT * FROM myTable
),
joint AS (
SELECT tab2.*
FROM tab2 INNER JOIN tab1
ON tab2.ID=tab1.ID AND tab2.collection_time=tab1.second_time
)
SELECT * EXCEPT(rn)
FROM (
SELECT *, ROW_NUMBER() OVER (PARTITION BY ID) AS rn
FROM joint
)
WHERE rn=1
Run Code Online (Sandbox Code Playgroud)