我有一个如下数据集:
library(data.table)
dt1 <- data.table(urn = c(rep("a", 5), rep("b", 4)),
amount = c(10, 12, 23, 15, 19, 42, 11, 5, 10),
date = as.Date(c("2016-01-01", "2017-01-02", "2017-02-04",
"2017-04-19", "2018-02-11", "2016-02-14",
"2017-05-06", "2017-05-12", "2017-12-12")))
dt1
# urn amount date
# 1: a 10 2016-01-01
# 2: a 12 2017-01-02
# 3: a 23 2017-02-04
# 4: a 15 2017-04-19
# 5: a 19 2018-02-11
# 6: b 42 2016-02-14
# 7: b 11 2017-05-06
# 8: b 5 2017-05-12
# 9: b 10 2017-12-12
Run Code Online (Sandbox Code Playgroud)
我试图确定一个组在过去12个月内的累积值.我知道我可以使用shift与data.table以向后或向前扫描,最大的挑战,我不能让我的头周围是怎么知道有多少记录时数可以改变基于多少记录每一个总结urn了.
我正在寻找的结果类型是:
dt1
# urn amount date summed12m
# 1: a 10 2016-01-01 10
# 2: a 12 2017-01-02 12
# 3: a 23 2017-02-04 35
# 4: a 15 2017-04-19 50
# 5: a 19 2018-02-11 34
# 6: b 42 2016-02-14 42
# 7: b 11 2017-05-06 11
# 8: b 5 2017-05-12 16
# 9: b 10 2017-12-12 26
Run Code Online (Sandbox Code Playgroud)
我最好data.table是根据我的数据量来寻找解决方案,但是如果它可能比具有大约12M记录的表更有效,那么我也可以选择其他选项.
作为替代方案foverlaps(),这也可以通过聚合非equi连接来解决:
library(lubridate)
dt1[, summed12m := dt1[.(urn, date, date %m-% months(12)),
on = .(urn = V1, date <= V2, date >= V3),
sum(amount), by = .EACHI]$V1][]
Run Code Online (Sandbox Code Playgroud)
Run Code Online (Sandbox Code Playgroud)urn amount date summed12m 1: a 10 2016-01-01 10 2: a 12 2017-01-02 12 3: a 23 2017-02-04 35 4: a 15 2017-04-19 50 5: a 19 2018-02-11 34 6: b 42 2016-02-14 42 7: b 11 2017-05-06 11 8: b 5 2017-05-12 16 9: b 10 2017-12-12 26
lubridate 用于日期算术,以避免在其中一个日期是2月29日的情况下发生意外.
关键部分是非等联接
dt1[.(urn, date, date %m-% months(12)),
on = .(urn = V1, date <= V2, date >= V3),
sum(amount), by = .EACHI]
Run Code Online (Sandbox Code Playgroud)
Run Code Online (Sandbox Code Playgroud)urn date date V1 1: a 2016-01-01 2015-01-01 10 2: a 2017-01-02 2016-01-02 12 3: a 2017-02-04 2016-02-04 35 4: a 2017-04-19 2016-04-19 50 5: a 2018-02-11 2017-02-11 34 6: b 2016-02-14 2015-02-14 42 7: b 2017-05-06 2016-05-06 11 8: b 2017-05-12 2016-05-12 16 9: b 2017-12-12 2016-12-12 26
其中最后一列被选中以创建新summed12m列dt1.
在OP一直问哪里V1,V2以及V3从何而来.
表达式动态.(urn, date, date %m-% months(12))创建一个新的data.table.(.()是的data.table缩写list()).由于未指定data.table列名V1,因此创建默认列名称V2等.
不那么邋,,表达式可以用明确命名的列重写
dt1[.(urn = urn, end = date, start = date %m-% months(12)),
on = .(urn, date <= end, date >= start),
sum(amount), by = .EACHI]
Run Code Online (Sandbox Code Playgroud)
这是在喊foverlaps。我第一次使用foverlaps,因此我很确定这里的一些专家可以更好地使用该功能。它是这样的:
dt1[, date2 := date]
rng <- dt1[, .(urn, enddate=date,
startdate=as.Date(paste(year(date)-1, month(date), mday(date), sep="-")))]
setkey(rng, urn, startdate, enddate)
foverlaps(dt1, rng, by.x=c("urn","date","date2"), type="within")[,
sum(amount), by=.(urn, enddate)]
# urn enddate V1
# 1: a 2016-01-01 10
# 2: a 2017-01-02 12
# 3: a 2017-02-04 35
# 4: a 2017-04-19 50
# 5: a 2018-02-11 34
# 6: b 2016-02-14 42
# 7: b 2017-05-06 11
# 8: b 2017-05-12 16
# 9: b 2017-12-12 26
Run Code Online (Sandbox Code Playgroud)
进一步阅读: