我正在通过以下自连接创建一个 Reddit 子 Reddits 之间重叠评论者数量的表:
SELECT t1.subreddit, t2.subreddit, COUNT(*) as NumOverlaps
FROM [fh-bigquery:reddit_comments.2015_05] t1
JOIN [fh-bigquery:reddit_comments.2015_05] t2
ON t1.author=t2.author
WHERE t1.subreddit<t2.subreddit
GROUP BY t1.subreddit, t2.subreddit;
Run Code Online (Sandbox Code Playgroud)
我在 Big Query 中对此数据集进行的典型查询很快完成(< 1 分钟),但该查询已经运行了一个多小时,但仍未完成。该数据有 54,504,410 行和 22 列。
我是否错过了应该实施的明显加速以使该查询快速运行?谢谢!
尝试下面
\n\nSELECT t1.subreddit, t2.subreddit, SUM(t1.cnt*t2.cnt) as NumOverlaps\nFROM (SELECT subreddit, author, COUNT(1) as cnt \n FROM [fh-bigquery:reddit_comments.2015_05] \n GROUP BY subreddit, author HAVING cnt > 1) t1\nJOIN (SELECT subreddit, author, COUNT(1) as cnt \n FROM [fh-bigquery:reddit_comments.2015_05] \n GROUP BY subreddit, author HAVING cnt > 1) t2\nON t1.author=t2.author\nWHERE t1.subreddit<t2.subreddit\nGROUP BY t1.subreddit, t2.subreddit\n
Run Code Online (Sandbox Code Playgroud)\n\n它做了两件事
\n首先,它预先聚合数据以避免冗余连接
\n其次,它消除了“潜在的异常值” - 那些在 subreddit 上只有一篇帖子的作者。当然,第二项取决于您的用例。但很可能应该没问题,从而解决性能问题。如果仍然比您预期的慢 - 将阈值增加到 2 或更大
\n\n\n跟进: ... 22,545,850,104 ... 似乎不正确...\n 应该是 SUM(t1.cnt+t2.cnt) 吗?
\n
当然这是不正确的,但是如果您能够运行有问题的查询,这正是您将得到的结果!
\n我希望你能够明白这一点!
\n所以,我很高兴修复 \xe2\x80\x9cperformance\xe2\x80\x9d 问题 \xe2\x80\x93 让您了解原始查询中的逻辑问题!
所以,是的,显然 22,545,850,104 是不正确的数字。
\n所以,而不是
SUM(t1.cnt*t2.cnt) as NumOverlaps \n
Run Code Online (Sandbox Code Playgroud)\n\n你应该使用简单的
\n\n SUM(1) as NumOverlaps as NumOverlaps \n
Run Code Online (Sandbox Code Playgroud)\n\n这将为您提供相当于使用的结果
\n\n EXACT_COUNT_DISTINCT(t1.author) as NumOverlaps \n
Run Code Online (Sandbox Code Playgroud)\n\n在你原来的查询中
\n\n所以,现在尝试以下:
\n\nSELECT t1.subreddit, t2.subreddit, SUM(1) as NumOverlaps\nFROM (SELECT subreddit, author, COUNT(1) as cnt \n FROM [fh-bigquery:reddit_comments.2015_05] \n GROUP BY subreddit, author HAVING cnt > 1) t1\nJOIN (SELECT subreddit, author, COUNT(1) as cnt \n FROM [fh-bigquery:reddit_comments.2015_05] \n GROUP BY subreddit, author HAVING cnt > 1) t2\nON t1.author=t2.author\nWHERE t1.subreddit<t2.subreddit\nGROUP BY t1.subreddit, t2.subreddit\n
Run Code Online (Sandbox Code Playgroud)\n
归档时间: |
|
查看次数: |
4875 次 |
最近记录: |