大查询中的自连接运行速度非常慢,我是否遵循最佳实践?

Tre*_* M. 4 google-bigquery

我正在通过以下自连接创建一个 Reddit 子 Reddits 之间重叠评论者数量的表:

SELECT t1.subreddit, t2.subreddit, COUNT(*) as NumOverlaps
FROM [fh-bigquery:reddit_comments.2015_05] t1
JOIN [fh-bigquery:reddit_comments.2015_05] t2
ON t1.author=t2.author
WHERE t1.subreddit<t2.subreddit
GROUP BY t1.subreddit, t2.subreddit;
Run Code Online (Sandbox Code Playgroud)

我在 Big Query 中对此数据集进行的典型查询很快完成(< 1 分钟),但该查询已经运行了一个多小时,但仍未完成。该数据有 54,504,410 行和 22 列。

我是否错过了应该实施的明显加速以使该查询快速运行?谢谢!

Mik*_*ant 5

尝试下面

\n\n
SELECT t1.subreddit, t2.subreddit, SUM(t1.cnt*t2.cnt) as NumOverlaps\nFROM (SELECT subreddit, author, COUNT(1) as cnt \n      FROM [fh-bigquery:reddit_comments.2015_05] \n      GROUP BY subreddit, author HAVING cnt > 1) t1\nJOIN (SELECT subreddit, author, COUNT(1) as cnt \n      FROM [fh-bigquery:reddit_comments.2015_05] \n      GROUP BY subreddit, author HAVING cnt > 1) t2\nON t1.author=t2.author\nWHERE t1.subreddit<t2.subreddit\nGROUP BY t1.subreddit, t2.subreddit\n
Run Code Online (Sandbox Code Playgroud)\n\n

它做了两件事
\n首先,它预先聚合数据以避免冗余连接
\n其次,它消除了“潜在的异常值” - 那些在 subreddit 上只有一篇帖子的作者。当然,第二项取决于您的用例。但很可能应该没问题,从而解决性能问题。如果仍然比您预期的慢 - 将阈值增加到 2 或更大

\n\n
\n

跟进: ... 22,545,850,104 ... 似乎不正确...\n 应该是 SUM(t1.cnt+t2.cnt) 吗?

\n
\n\n

当然这是不正确的,但是如果您能够运行有问题的查询,这正是您将得到的结果!
\n我希望你能够明白这一点!
\n所以,我很高兴修复 \xe2\x80\x9cperformance\xe2\x80\x9d 问题 \xe2\x80\x93 让您了解原始查询中的逻辑问题!

\n\n

所以,是的,显然 22,545,850,104 是不正确的数字。
\n所以,而不是

\n\n
    SUM(t1.cnt*t2.cnt) as NumOverlaps   \n
Run Code Online (Sandbox Code Playgroud)\n\n

你应该使用简单的

\n\n
    SUM(1) as NumOverlaps as NumOverlaps   \n
Run Code Online (Sandbox Code Playgroud)\n\n

这将为您提供相当于使用的结果

\n\n
    EXACT_COUNT_DISTINCT(t1.author) as NumOverlaps   \n
Run Code Online (Sandbox Code Playgroud)\n\n

在你原来的查询中

\n\n

所以,现在尝试以下:

\n\n
SELECT t1.subreddit, t2.subreddit, SUM(1) as NumOverlaps\nFROM (SELECT subreddit, author, COUNT(1) as cnt \n      FROM [fh-bigquery:reddit_comments.2015_05] \n      GROUP BY subreddit, author HAVING cnt > 1) t1\nJOIN (SELECT subreddit, author, COUNT(1) as cnt \n      FROM [fh-bigquery:reddit_comments.2015_05] \n      GROUP BY subreddit, author HAVING cnt > 1) t2\nON t1.author=t2.author\nWHERE t1.subreddit<t2.subreddit\nGROUP BY t1.subreddit, t2.subreddit\n
Run Code Online (Sandbox Code Playgroud)\n