大查询的不同和分组依据

F.D*_*F.D 4 sql reddit google-bigquery

选择每个 GROUP BY 组中的第一行开始?我正在尝试在 Google 大查询中做一个非常相似的事情。

数据集:fh-bigquery:reddit_comments.2018_01

目标:对于每个 link_id(Reddit 提交),根据 created_utc 选择第一条评论

SELECT body,link_id 
FROM [fh-bigquery:reddit_comments.2018_01] 
where subreddit_id == "t5_2zkvo"  
group by  link_id ,body, created_utc  
order by link_id ,body,  created_utc desc 
Run Code Online (Sandbox Code Playgroud)

目前它不起作用,因为它仍然没有给我唯一/不同的 parent_id(s)

谢谢,麻烦您了!


编辑: 我说 parent_id 是 == 提交是不正确的,它实际上是 link_id

Mik*_*ant 7

下面是 BigQuery 标准 SQL

#standardSQL
SELECT 
  ARRAY_AGG(body ORDER BY created_utc LIMIT 1)[OFFSET(0)] body, 
  link_id
FROM `fh-bigquery.reddit_comments.2018_01`
WHERE subreddit_id = 't5_2zkvo'
GROUP BY link_id
-- ORDER BY link_id
Run Code Online (Sandbox Code Playgroud)


Tim*_*sen 6

我们可以在这里使用ROW_NUMBER()

SELECT body, parent_id, created_utc
FROM
(
    SELECT *, ROW_NUMBER() OVER (PARTITION BY parent_id ORDER BY created_utc) rn
    FROM [fh-bigquery:reddit_comments.2018_01]
    WHERE subreddit_id = 't5_2zkvo'
) t
WHERE rn = 1
ORDER BY parent_id ,body, created_utc DESC;
Run Code Online (Sandbox Code Playgroud)

请注意,您可以继续使用当前的方法,但是您必须将查询表述为表和子查询之间的联接,该子查询查找每个评论的最早条目:

SELECT t1.*
FROM [fh-bigquery:reddit_comments.2018_01] t1
INNER JOIN
(
    SELECT parent_id, MIN(created_utc) AS first_created_utc
    FROM [fh-bigquery:reddit_comments.2018_01]
    GROUP BY parent_id
) t2
    ON t1.parent_id = t2.parent_id AND t1.created_utc = t2.first_created_utc;
Run Code Online (Sandbox Code Playgroud)