arg*_*t91 3 r date time-series rolling-computation
我有以下 data.frame:
grp nr yr
1: A 1.0 2009
2: A 2.0 2009
3: A 1.5 2009
4: A 1.0 2010
5: B 3.0 2009
6: B 2.0 2010
7: B NA 2011
8: C 3.0 2014
9: C 3.0 2019
10: C 3.0 2020
11: C 4.0 2021
Run Code Online (Sandbox Code Playgroud)
期望的输出:
grp nr yr nr_roll_period_3
1 A 1.0 2009 NA
2 A 2.0 2009 NA
3 A 1.5 2009 NA
4 A 1.0 2010 NA
5 B 3.0 2009 NA
6 B 2.0 2010 NA
7 B NA 2011 NA
8 C 3.0 2014 NA
9 C 3.0 2019 NA
10 C 3.0 2020 NA
11 C 4.0 2021 3.333333
Run Code Online (Sandbox Code Playgroud)
逻辑:
目前我有这个功能:
calculate_rolling_window <-
function(dt, date_col, calc_col, id, k) {
require(data.table)
return(setDT(dt)[
, paste(calc_col, "roll_period", k, sep = "_") :=
sapply(get(date_col), function(x) mean(get(calc_col)[between(get(date_col), x - k + 1, x)])),
by = mget(id)])
}
Run Code Online (Sandbox Code Playgroud)
它适用于常规情况,日期列中没有重复项。但是,如果重复,它就会失败:
grp nr yr nr_roll_period_3
1: A 1.0 2009 1.500000
2: A 2.0 2009 1.500000
3: A 1.5 2009 1.500000
4: A 1.0 2010 1.375000
5: B 3.0 2009 NA
6: B 2.0 2010 NA
7: B NA 2011 NA
8: C 3.0 2014 NA
9: C 3.0 2019 NA
10: C 3.0 2020 NA
11: C 4.0 2021 3.333333
Run Code Online (Sandbox Code Playgroud)
关于如何处理这个的任何想法?不需要专门的data.table方法。
这可以通过在 non-equi join 中分组以在length 的滚动窗口上聚合k、过滤k连续年份和更新 join 来解决:
library(data.table)
k <- 3L
# group by join parameters of a non-equi join
mDT <- setDT(DT)[.(grp = grp, upper = yr, lower = yr - k),
on = .(grp, yr <= upper, yr > lower),
.(uniqueN(x.yr), mean(nr)), by = .EACHI]
# update join with filtered intermediate result
DT[mDT[V1 == k], on = .(grp, yr), paste0("nr_roll_period_", k) := V2]
DT
Run Code Online (Sandbox Code Playgroud)
返回 OP 的预期结果:
Run Code Online (Sandbox Code Playgroud)grp nr yr nr_roll_period 1: A 1.0 2009 NA 2: A 2.0 2009 NA 3: A 1.5 2009 NA 4: A 1.0 2010 NA 5: B 3.0 2009 NA 6: B 2.0 2010 NA 7: B NA 2011 NA 8: C 3.0 2014 NA 9: C 3.0 2019 NA 10: C 3.0 2020 NA 11: C 4.0 2021 3.333333
中间结果mDT包含滚动平均值V2超过k周期和独特的/不同的年计数V1每个周期内。它是由一个创建非球菌加入的DT与一个data.table含有通过在即时创建的上限和下限.(grp = grp, upper = yr, lower = yr - k)。
mDT
Run Code Online (Sandbox Code Playgroud)
Run Code Online (Sandbox Code Playgroud)grp yr yr V1 V2 1: A 2009 2006 1 1.500000 2: A 2009 2006 1 1.500000 3: A 2009 2006 1 1.500000 4: A 2010 2007 2 1.375000 5: B 2009 2006 1 3.000000 6: B 2010 2007 2 2.500000 7: B 2011 2008 3 NA 8: C 2014 2011 1 3.000000 9: C 2019 2016 1 3.000000 10: C 2020 2017 2 3.000000 11: C 2021 2018 3 3.333333
这是针对包含完全k 不同年份的行进行过滤的:
mDT[V1 == k]
Run Code Online (Sandbox Code Playgroud)
Run Code Online (Sandbox Code Playgroud)grp yr yr V1 V2 1: B 2011 2008 3 NA 2: C 2021 2018 3 3.333333
最后,将 this 加入DT以将新列附加到DT.
请注意,如果输入数据中有 ,则默认mean()返回。NANA
library(data.table)
DT <- fread(text = "rn grp nr yr
1: A 1.0 2009
2: A 2.0 2009
3: A 1.5 2009
4: A 1.0 2010
5: B 3.0 2009
6: B 2.0 2010
7: B NA 2011
8: C 3.0 2014
9: C 3.0 2019
10: C 3.0 2020
11: C 4.0 2021", drop = 1L)
Run Code Online (Sandbox Code Playgroud)