sta*_*123 5 r dataframe dplyr data.table
这是我在stackoverflow上的第一篇文章,所以如果我的文章不够详细,请见谅。
我有一个包含两列(日期和组 ID)的数据表。在当前日期,我想计算过去 x 天内发生的组发生次数。对于下面的示例,我们可以说过去 30 天。
date = c("2014-04-01", "2014-04-12", "2014-04-07", "2014-05-03", "2014-04-14", "2014-05-04", "2014-03-31", "2014-04-18", "2014-04-23", "2014-04-01")
group = c("G","G","F","G","E","E","H","H","H","A")
dt = data.table(cbind(group,date))
group date
1: G 2014-04-01
2: G 2014-04-12
3: F 2014-04-07
4: G 2014-05-03
5: E 2014-04-14
6: E 2014-05-04
7: H 2014-03-31
8: H 2014-04-18
9: H 2014-04-23
10: A 2014-04-01
Run Code Online (Sandbox Code Playgroud)
所以,我想要的新列看起来像这样:
group date count
1: G 2014-04-01 0
2: G 2014-04-12 1
3: F 2014-04-07 0
4: G 2014-05-03 1 (not including first G since it is outside 30 days)
5: E 2014-04-14 0
6: E 2014-05-04 1
7: H 2014-03-31 0
8: H 2014-04-18 1
9: H 2014-04-23 2
10: A 2014-04-01 0
Run Code Online (Sandbox Code Playgroud)
我能够使用 dplyr 在计算当前日期组的出现次数时执行非窗口计数,但我正在努力寻找一种方法来进行 30 天计数。对于非窗口计数,我执行了以下操作:
dt = data.table(dt %>%
group_by(group) %>%
mutate(count = row_number() - 1))
group date count
1: G 2014-04-01 0
2: G 2014-04-12 1
3: F 2014-04-07 0
4: G 2014-05-03 2
5: E 2014-04-14 0
6: E 2014-05-04 1
7: H 2014-03-31 0
8: H 2014-04-18 1
9: H 2014-04-23 2
10: A 2014-04-01 0
Run Code Online (Sandbox Code Playgroud)
这是数据集的一个小样本,其中整个数据集包含几百万行,所以我需要一些高效的东西。任何提示或建议将不胜感激。先感谢您!
一个data.table选项
dt[, date := as.Date(date)][, count := cumsum(date <= first(date) + 30) - 1, group]
Run Code Online (Sandbox Code Playgroud)
给
> dt
group date count
1: G 2014-04-01 0
2: G 2014-04-12 1
3: F 2014-04-07 0
4: G 2014-05-03 1
5: E 2014-04-14 0
6: E 2014-05-04 1
7: H 2014-03-31 0
8: H 2014-04-18 1
9: H 2014-04-23 2
10: A 2014-04-01 0
Run Code Online (Sandbox Code Playgroud)
dplyr遵循类似想法的选项
dt %>%
mutate(date = as.Date(date)) %>%
group_by(group) %>%
mutate(count = cumsum(date <= first(date) + 30) - 1) %>%
ungroup()
Run Code Online (Sandbox Code Playgroud)