Fel*_*ffa 1 sql reddit google-bigquery
我正在查看 reddit 数据集和一个旧的问题,该问题研究使用 BigQuery 查找双元组 - 但是该问题的答案不适用于 URL、引号等。有没有更好的方法来做到这一点?也将其推广到三元组而不是二元组?
这将:
SELECT word, nextword, nextword2, COUNT(*) c
FROM (
SELECT pos, id, word, LEAD(word) OVER(PARTITION BY id ORDER BY pos) nextword, LEAD(word, 2) OVER(PARTITION BY id ORDER BY pos) nextword2 FROM (
SELECT id, word, pos FROM FLATTEN(
(SELECT id, REGEXP_REPLACE(word, 'QUOTE', "'") word, POSITION(word) pos FROM
(SELECT id, SPLIT(REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(LOWER(body), "'", 'QUOTE'), r'http.?://[^ ]*', r'URL'), r'\b', ' '), ' ') word
FROM [fh-bigquery:reddit_comments.2016_01]
WHERE score>200
HAVING REGEXP_MATCH(word, '[a-zA-Z0-9]')
)
), word)
))
WHERE nextword IS NOT null
GROUP EACH BY 1, 2, 3
ORDER BY c DESC
LIMIT 100
Run Code Online (Sandbox Code Playgroud)
(请注意,我正在过滤得分 >200 的评论以获得更快的结果 - 您可以整整一个月移动该过滤器)