Spark SQL 的累积非重复计数

Tho*_*aux 5 sql apache-spark apache-spark-sql

使用 Spark 1.6.2。

这里的数据:

day | visitorID
-------------
1   | A
1   | B
2   | A
2   | C
3   | A
4   | A
Run Code Online (Sandbox Code Playgroud)

我想计算前一天每天 + cumul 有多少不同的访问者(我不知道确切的术语,抱歉)。

这应该给出:

day | visitors
--------------
 1  | 2 (A+B)
 2  | 3 (A+B+C)
 3  | 3 
 4  | 3
Run Code Online (Sandbox Code Playgroud)
  • 尝试过自加入但真的太慢了
  • 我确信窗口函数是我正在寻找的,但没有找到它:/

Gor*_*off 4

您应该能够执行以下操作:

select day, max(visitors) as visitors
from (select day,
             count(distinct visitorId) over (order by day) as visitors
      from t
     ) d
group by day;
Run Code Online (Sandbox Code Playgroud)

实际上,我认为更好的方法是仅在访客出现的第一天记录他/她:

select startday, sum(count(*)) over (order by startday) as visitors
from (select visitorId, min(day) as startday
      from t
      group by visitorId
     ) t
group by startday
order by startday;
Run Code Online (Sandbox Code Playgroud)