大查询中的自连接运行速度非常慢，我是否遵循最佳实践？

Question

大查询中的自连接运行速度非常慢，我是否遵循最佳实践？

我正在通过以下自连接创建一个 Reddit 子 Reddits 之间重叠评论者数量的表：

SELECT t1.subreddit, t2.subreddit, COUNT(*) as NumOverlaps
FROM [fh-bigquery:reddit_comments.2015_05] t1
JOIN [fh-bigquery:reddit_comments.2015_05] t2
ON t1.author=t2.author
WHERE t1.subreddit<t2.subreddit
GROUP BY t1.subreddit, t2.subreddit;

Run Code Online (Sandbox Code Playgroud)

我在 Big Query 中对此数据集进行的典型查询很快完成（< 1 分钟），但该查询已经运行了一个多小时，但仍未完成。该数据有 54,504,410 行和 22 列。

我是否错过了应该实施的明显加速以使该查询快速运行？谢谢！

Answer 1

Mik*_*ant 5

尝试下面

\n\n

SELECT t1.subreddit, t2.subreddit, SUM(t1.cnt*t2.cnt) as NumOverlaps\nFROM (SELECT subreddit, author, COUNT(1) as cnt \n      FROM [fh-bigquery:reddit_comments.2015_05] \n      GROUP BY subreddit, author HAVING cnt > 1) t1\nJOIN (SELECT subreddit, author, COUNT(1) as cnt \n      FROM [fh-bigquery:reddit_comments.2015_05] \n      GROUP BY subreddit, author HAVING cnt > 1) t2\nON t1.author=t2.author\nWHERE t1.subreddit<t2.subreddit\nGROUP BY t1.subreddit, t2.subreddit\n

Run Code Online (Sandbox Code Playgroud)\n\n

它做了两件事
\n首先，它预先聚合数据以避免冗余连接
\n其次，它消除了“潜在的异常值” - 那些在 subreddit 上只有一篇帖子的作者。当然，第二项取决于您的用例。但很可能应该没问题，从而解决性能问题。如果仍然比您预期的慢 - 将阈值增加到 2 或更大

\n\n

\n
跟进： ... 22,545,850,104 ... 似乎不正确...\n 应该是 SUM(t1.cnt+t2.cnt) 吗？
\n

\n\n

当然这是不正确的，但是如果您能够运行有问题的查询，这正是您将得到的结果！
\n我希望你能够明白这一点！
\n所以，我很高兴修复 \xe2\x80\x9cperformance\xe2\x80\x9d 问题 \xe2\x80\x93 让您了解原始查询中的逻辑问题！

\n\n

所以，是的，显然 22,545,850,104 是不正确的数字。
\n所以，而不是

\n\n

    SUM(t1.cnt*t2.cnt) as NumOverlaps   \n

Run Code Online (Sandbox Code Playgroud)\n\n

你应该使用简单的

\n\n

    SUM(1) as NumOverlaps as NumOverlaps   \n

Run Code Online (Sandbox Code Playgroud)\n\n

这将为您提供相当于使用的结果

\n\n

    EXACT_COUNT_DISTINCT(t1.author) as NumOverlaps   \n

Run Code Online (Sandbox Code Playgroud)\n\n

在你原来的查询中

\n\n

所以，现在尝试以下：

\n\n

SELECT t1.subreddit, t2.subreddit, SUM(1) as NumOverlaps\nFROM (SELECT subreddit, author, COUNT(1) as cnt \n      FROM [fh-bigquery:reddit_comments.2015_05] \n      GROUP BY subreddit, author HAVING cnt > 1) t1\nJOIN (SELECT subreddit, author, COUNT(1) as cnt \n      FROM [fh-bigquery:reddit_comments.2015_05] \n      GROUP BY subreddit, author HAVING cnt > 1) t2\nON t1.author=t2.author\nWHERE t1.subreddit<t2.subreddit\nGROUP BY t1.subreddit, t2.subreddit\n

Run Code Online (Sandbox Code Playgroud)\n

归档时间：	8 年，10 月前
查看次数：	4875 次
最近记录：	8 年，10 月前