累积非重复计数

use*_*744 7 sql presto

我正在查询以每天获取累积的不同 uid 计数。

示例:假设有 2 个 uids (100,200) 出现在日期 2016-11-01 并且它们也在第二天出现在 2016-11-02 的新 uid 300 (100,200,300) 此时我希望存储累积计数为 3,而不是5 as(用户 ID 100 和 200 已在过去一天出现)。

Input table:

    date            uid         
2016-11-01          100
2016-11-01          200
2016-11-01          300
2016-11-01          400         
2016-11-02          100
2016-11-02          200                 
2016-11-03          300
2016-11-03          400
2016-11-03          500
2016-11-03          600
2016-11-04          700

Expected query result:

date            daily_cumulative_count
2016-11-01              4   
2016-11-02              4
2016-11-03              6
2016-11-04              7
Run Code Online (Sandbox Code Playgroud)

到目前为止,我每天都能获得累积的不同计数,但它也包括前一天的不同 uid。

SELECT 
  date, 
  SUM(count) OVER (
    ORDER BY date ASC 
    ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
  )
FROM (
  SELECT 
    date, 
    COUNT(DISTINCT uid) AS count
  FROM sample_table
  GROUP by 1
)ORDER BY date DESC;
Run Code Online (Sandbox Code Playgroud)

任何形式的帮助将不胜感激。

cak*_*aww 15

WITH firstseen AS (
  SELECT uid, MIN(date) date
  FROM sample_table
  GROUP BY 1
)
SELECT DISTINCT date, COUNT(uid) OVER (ORDER BY date) daily_cumulative_count 
FROM firstseen
ORDER BY 1
Run Code Online (Sandbox Code Playgroud)

使用SELECT DISTINCTbecause(date, COUNT(uid))会重复很多次。

说明:对于每个日期dt,它都会计算从最早日期到 的 uid dt,因为我们正在指定ORDER BY date并且它默认为BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW


Vam*_*ala 8

您可以用来exists检查之前的任何日期是否存在 ID。然后获取运行总和并找到每个组的最大值,这将为您提供每日不同的累积计数。

select dt, max(col) as daily_cumulative_count
from (select t1.*, 
      sum(case when not exists (select 1 from t where t1.dt > dt and id = t1.uid) then 1 else 0 end) over(order by dt) col
      from t t1) x 
group by dt
Run Code Online (Sandbox Code Playgroud)


小智 6

最简单的方法:

SELECT *, count(*) over (order by fst_date ) cum_uids
  FROM (
SELECT uid, min(date) fst_date FROM t GROUP BY uid
 ) t
Run Code Online (Sandbox Code Playgroud)

或类似的东西