我有一大堆来自 reddit 的评论。这些字符串被分割成单词,删除标点符号,并进行量化以显示特定 subreddit 上最常用的单词:
SELECT word, COUNT(*) as num_words
FROM(FLATTEN((
SELECT SPLIT(LOWER(REGEXP_REPLACE(body, r'[\.\",*:()\[\]/|\n]', ' ')), ' ') word
FROM [fh-bigquery:reddit_comments.2017_08]
WHERE subreddit="The_Donald"
), word))
GROUP EACH BY word
HAVING num_words >= 1000
ORDER BY num_words DESC
Run Code Online (Sandbox Code Playgroud)
我有一个要删除的停用词列表,我该如何将其添加到代码中?谢谢 :)