redshift:通过窗口分区计算不同的客户

Question

redshift:通过窗口分区计算不同的客户

Redshift DISTINCT在其窗口函数中不支持聚合.用于COUNT声明此状态的AWS文档,并且distinct不支持任何窗口函数.

我的用例:通过不同的时间间隔和流量渠道统计客户

我希望本年度的月度和年初至今独特的客户数量,并且还按交通渠道和所有渠道的总数进行划分.由于客户可以多次访问我只需要计算不同的客户,因此Redshift窗口聚合将无济于事.

我可以统计不同的客户使用count(distinct customer_id)...group by,但这只会给我四个所需的结果.
我并不想进入运行了一堆之间堆积每个需要计数一个完整的查询习惯union all.我希望这不是唯一的解决方案.

这就是我在postgres(或Oracle)中写的内容:

select order_month
       , traffic_channel
       , count(distinct customer_id) over(partition by order_month, traffic_channel) as customers_by_channel_and_month
       , count(distinct customer_id) over(partition by traffic_channel) as ytd_customers_by_channel
       , count(distinct customer_id) over(partition by order_month) as monthly_customers_all_channels
       , count(distinct customer_id) over() as ytd_total_customers

from orders_traffic_channels
/* otc is a table of dated transactions of customers, channels, and month of order */

where to_char(order_month, 'YYYY') = '2017'

Run Code Online (Sandbox Code Playgroud)

我怎样才能在Redshift中解决这个问题？

结果需要在红移群集上工作,此外这是一个简化的问题,实际的期望结果具有产品类别和客户类型,这会使所需的分区数量倍增.因此,一堆union all汇总不是一个很好的解决方案.

Answer 1

Mer*_*lin 10

2016年的博客文章提出了这个问题并提供了一个基本的解决方法,谢谢Mark D. Adams.奇怪的是,我无法在整个网络上找到,因此我正在分享我的(经过测试的)解决方案.

关键的见解是dense_rank(),按相关项目排序,为相同的项目提供相同的排名,因此最高排名也是唯一项目的计数.如果您尝试为我想要的每个分区交换以下内容,这是一个可怕的混乱:

dense_rank() over(partition by order_month, traffic_channel order by customer_id)

Run Code Online (Sandbox Code Playgroud)

由于您需要最高级别,因此您必须子查询所有内容并从每个排名中选择最大值.将外部查询中的分区与子查询中的相应分区进行匹配非常重要.

/* multigrain windowed distinct count, additional grains are one dense_rank and one max over() */
select distinct
       order_month
       , traffic_channel
       , max(tc_mth_rnk) over(partition by order_month, traffic_channel) customers_by_channel_and_month
       , max(tc_rnk) over(partition by traffic_channel)  ytd_customers_by_channel
       , max(mth_rnk) over(partition by order_month)  monthly_customers_all_channels
       , max(cust_rnk) over()  ytd_total_customers

from (
       select order_month
              , traffic_channel
              , dense_rank() over(partition by order_month, traffic_channel order by customer_id)  tc_mth_rnk
              , dense_rank() over(partition by traffic_channel order by customer_id)  tc_rnk
              , dense_rank() over(partition by order_month order by customer_id)  mth_rnk
              , dense_rank() over(order by customer_id)  cust_rnk

       from orders_traffic_channels

       where to_char(order_month, 'YYYY') = '2017'
     )

order by order_month, traffic_channel
;

Run Code Online (Sandbox Code Playgroud)

笔记

分区max()和dense_rank()必须匹配
dense_rank()将排名空值(所有在同一级别,最大).如果你不想计算null你需要的值case when customer_id is not null then dense_rank() ...etc...,或者max()如果你知道有空值你可以从中减去一个.

Answer 2

alb*_*lin 5

虽然 Redshift 在其窗口函数中不支持 DISTINCT 聚合，但它确实有一个listaggdistinct函数。所以你可以这样做：

regexp_count(
   listaggdistinct(customer_id, ',') over (partition by field2), 
   ','
) + 1

Run Code Online (Sandbox Code Playgroud)

当然，如果您,的 customer_id 字符串中自然出现，则必须找到一个安全的分隔符。

我怀疑从表中计算 10MM 记录时会出现问题。 (2认同)
如果结果集大于最大 VARCHAR 大小（64K – 1 或 65535），则 LISTAGG 将返回以下错误：无效操作：结果大小超出 LISTAGG 限制。[参见文档](https://docs.amazonaws.cn/en_us/redshift/latest/dg/r_WF_LISTAGG.html) (2认同)

归档时间：	7 年，12 月前
查看次数：	7828 次
最近记录：	7 年，11 月前