R的滚动日期范围内的唯一值的计数

Isa*_*aac 5 sql r time-series correlated-subquery data.table

这个问题已经有了SQL答案,我能够在R中实现该解决方案sqldf.但是,我一直无法找到实现它的方法data.table.

问题是如果数据如下所示,计算滚动日期范围内一列的不同值,例如(并直接引用链接的问题):

Date   | email 
-------+----------------
1/1/12 | test@test.com
1/1/12 | test1@test.com
1/1/12 | test2@test.com
1/2/12 | test1@test.com
1/2/12 | test2@test.com
1/3/12 | test@test.com
1/4/12 | test@test.com
1/5/12 | test@test.com
1/5/12 | test@test.com
1/6/12 | test@test.com
1/6/12 | test@test.com
1/6/12 | test1@test.com
Run Code Online (Sandbox Code Playgroud)

如果我们使用3天的日期,那么结果集看起来就像这样

date   | count(distinct email)
-------+------
1/1/12 | 3
1/2/12 | 3
1/3/12 | 3
1/4/12 | 3
1/5/12 | 2
1/6/12 | 2
Run Code Online (Sandbox Code Playgroud)

以下是使用data.table以下命令在R中创建相同数据的代码:

date <- as.Date(c('2012-01-01','2012-01-01','2012-01-01',
                  '2012-01-02','2012-01-02','2012-01-03',
                  '2012-01-04','2012-01-05','2012-01-05',
                  '2012-01-06','2012-01-06','2012-01-06'))
email <- c('test@test.com', 'test1@test.com','test2@test.com',
           'test1@test.com', 'test2@test.com','test@test.com',
           'test@test.com','test@test.com','test@test.com',
           'test@test.com','test@test.com','test1@test.com')
dt <- data.table(date, email)
Run Code Online (Sandbox Code Playgroud)

任何有关这方面的帮助将非常感激.谢谢!

编辑1:

这是一个玩具问题,我想应用于更大的数据集,因此使用笛卡尔积很有问题.相反,我想要一些等同于SQL中的相关子查询的东西,例如我最初链接的问题的解决方案是:

SELECT day
     ,(SELECT count(DISTINCT email)
       FROM   tbl
       WHERE  day BETWEEN t.day - 2 AND t.day -- period of 3 days
      ) AS dist_emails
FROM   tbl t
WHERE  day BETWEEN '2012-01-01' AND '2012-01-06'  
GROUP  BY 1
ORDER  BY 1;
Run Code Online (Sandbox Code Playgroud)

编辑2:根据@ MichaelChirico的解决方案,根据@jangorecki的要求,这是一些时间:

# The data
> dim(temp)
[1] 2627785       4
> head(temp)
         date category1 category2 itemId
1: 2013-11-08         0         2   1713
2: 2013-11-08         0         2  90485
3: 2013-11-08         0         2  74249
4: 2013-11-08         0         2   2592
5: 2013-11-08         0         2   2592
6: 2013-11-08         0         2    765
> uniqueN(temp$itemId)
[1] 13510
> uniqueN(temp$date)
[1] 127

# Timing for data.table
> system.time(dtTime <- temp[,
+   .(count = temp[.(seq.Date(.BY$date - 6L, .BY$date, "day"), 
+   .BY$category1, .BY$category2 ), uniqueN(itemId), nomatch = 0L]), 
+  by = c("date","category1","category2")])
   user  system elapsed 
  6.913   0.130   6.940 
> 
# Time for sqldf
> system.time(sqlDfTime <- 
+ sqldf(c("create index ldx on temp(date, category1, category2)",
+ "SELECT date, category1, category2,
+ (SELECT count(DISTINCT itemId)
+   FROM   temp
+   WHERE category1 = t.category1 AND category2 = t.category2 AND
+   date BETWEEN t.date - 6 AND t.date 
+   ) AS numItems
+ FROM temp t
+ GROUP BY date, category1, category2
+ ORDER BY 1;"))
   user  system elapsed 
 87.225   0.098  87.295 
Run Code Online (Sandbox Code Playgroud)

输出是等效的,但使用data.table而不是sqldf导致12.5倍的加速.相当实质!

Mic*_*ico 7

这是有效的,利用了新的非等值连接功能data.table.

dt[dt[ , .(date3=date, date2 = date - 2, email)], 
   on = .(date >= date2, date<=date3), 
   allow.cartesian = TRUE
   ][ , .(count = uniqueN(email)), 
      by = .(date = date + 2)]
#          date V1
# 1: 2011-12-30  3
# 2: 2011-12-31  3
# 3: 2012-01-01  3
# 4: 2012-01-02  3
# 5: 2012-01-03  1
# 6: 2012-01-04  2
Run Code Online (Sandbox Code Playgroud)

说实话,我对这是如何工作有点恼火,但我的想法是加入dt自己date,匹配date2天前和今天之间的任何.我不确定为什么我们必须通过date = date + 2事后设置来清理.


这是使用键的方法:

setkey(dt, date)

dt[ , .(count = dt[.(seq.Date(.BY$date - 2L, .BY$date, "day")),
                   uniqueN(email), nomatch = 0L]), by = date]
Run Code Online (Sandbox Code Playgroud)