Amazon Redshift/PostgreSQL 中的高效 GROUP BY CASE 表达式

Question

Amazon Redshift/PostgreSQL 中的高效 GROUP BY CASE 表达式

Sim*_*Sim 5 sql postgresql group-by paraccel amazon-redshift

在分析处理中，通常需要将“不重要”的数据组折叠到结果表中的一行中。实现此目的的一种方法是按 CASE 表达式进行 GROUP BY，其中通过返回单个值（例如，组的 NULL）的 CASE 表达式将不重要的组合并为一行。这个问题是关于在 Amazon Redshift 中执行此分组的有效方法，它基于 ParAccel：在功能方面接近 PosgreSQL 8.0。

例如，考虑表中的 GROUP BY，type其中url每行都是单个 URL 访问。目标是执行聚合，以便为 URL 访问计数超过特定阈值的每个 (type, url) 对发出一行，并为访问次数超过特定阈值的所有 (type, url) 对发出一行 (type, NULL )计数低于该阈值。结果表中的其余列将具有基于此分组的 SUM/COUNT 聚合。

例如以下数据

+------+----------------------+-----------------------+
| type | url                  | < 50+ other columns > |
+------+----------------------+-----------------------+
|  A   | http://popular.com   |                       |
|  A   | http://popular.com   |                       |
|  A   | < 9997 more times>   |                       |
|  A   | http://popular.com   |                       |
|  A   | http://small-one.com |                       |
|  B   | http://tiny.com      |                       |
|  B   | http://tiny-too.com  |                       |

Run Code Online (Sandbox Code Playgroud)

应产生以下结果表，阈值为 10,000

+------+------------------------------------+--------------------------+
| type | url                  | visit_count | < SUM/COUNT aggregates > |
+------+------------------------------------+--------------------------+
|  A   | http://popular.com   |       10000 |                          |
|  A   |                      |           1 |                          |
|  B   |                      |           2 |                          |

Run Code Online (Sandbox Code Playgroud)

概括：

Amazon Redshift 具有某些子查询关联限制，需要小心应对。下面 Gordon Linoff 的答案（已接受的答案）显示了如何使用双重聚合来执行 GROUP BY CASE 表达式，并在结果列和外部 GROUP BY 子句中复制表达式。

with temp_counts as (SELECT type, url, COUNT(*) as cnt FROM t GROUP BY type, url)
select type, (case when cnt >= 10000 then url end) as url, sum(cnt) as cnt
from temp_counts
group by type, (case when cnt >= 10000 then url end)

Run Code Online (Sandbox Code Playgroud)

进一步的测试表明，双重聚合可以“展开”为涉及每个独立 CASE 表达式的独立查询的 UNION ALL。在这个具有大约 200M 行的示例数据集的特殊情况下，这种方法的执行速度始终快了 30% 左右。然而，该结果是特定于模式和数据的。

with temp_counts as (SELECT type, url, COUNT(*) as cnt FROM t GROUP BY type, url)
select * from temp_counts WHERE cnt >= 10000
UNION ALL
SELECT type, NULL as url, SUM(cnt) as cnt from temp_counts 
WHERE cnt < 10000 
GROUP BY type

Run Code Online (Sandbox Code Playgroud)

这提出了在 Amazon Redshift 中实施和优化任意脱节分组和汇总的两种通用模式。如果性能对您很重要，请对两者进行基准测试。

Answer 1

Gor*_*off 3

您可以使用两个聚合来完成此操作：

select type, (case when cnt > XXX then url end) as url, sum(cnt) as visit_cnt
from (select type, url, count(*) as cnt
      from t
      group by type, url
     ) t
group by type, (case when cnt > XXX then url end)
order by type, sum(cnt) desc;

Run Code Online (Sandbox Code Playgroud)

归档时间：	12 年前
查看次数：	11522 次
最近记录：	12 年前