是否有更快的方法在组内对 data.table 进行子集化,保留 2 年内发生的最近 3 个或更多日期?
我有一组包含分组列和日期列的数据,我需要保留 2 年内最近的 3 个或更多日期。
我编写了一个函数,它接受一个对象的 data.table 并使用 for 循环来计算 date[i] - 日期,然后在对行进行子集化之前,如果一年内至少有 3 个日期发生,则中断循环。然后将该函数应用于按升序分组和按降序排列日期的数据。它可以工作,但需要相当长的时间来运行。
# Loading packages --------------------------------------------------------
library(data.table)
library(lubridate)
# Generating test data ----------------------------------------------------
test.dt <- data.table(grouping = c(rep(1, times = 17),
rep(2, times = 30),
rep(3, times = 7),
rep(4, times = 9),
rep(5, times = 8)),
date = c(sample(seq(dmy('28/8/2007'), dmy('3/10/2017'),
by = 'day'), 17),
sample(seq(dmy('7/5/2007'), dmy('19/4/2016'),
by = 'day'), 30),
sample(seq(dmy('28/12/2011'), dmy('3/10/2013'),
by = 'day'), 7),
sample(seq(dmy('21/12/2007'), dmy('11/11/2010'),
by = 'day'), 9),
sample(seq(dmy('27/8/2007'), dmy('5/2/2012'),
by = 'day'), 8)))
# Filtering function ------------------------------------------------------
filter.fun <- function(dt) {
for (i in seq_len(length(dt$date) - 2)) {
temp <- dt[date[i] - date <= dyears(2) &
date[i] - date >= dyears(0),
less_than_2years := "Yes"]
if(temp[less_than_2years == "Yes", .N] >= 3){
filtered_temp <- temp[less_than_2years == "Yes"
][, less_than_2years := NULL]
return(filtered_temp)
break()
}
} else {temp[, less_than_2years := NULL]}
}
# Sorting data.table ------------------------------------------------------
setorder(test.dt, group, -date)
# Applying function to data.table by grouping -----------------------------
test.dt <- mydata.dt[
, {
SD_copy <- copy(.SD)
filter.fun(SD_copy)},
by = group]
Run Code Online (Sandbox Code Playgroud)
将函数应用于完整数据集(约 139,000 行)大约需要 76.268 秒才能完成。有没有更有效/更快的方法来做到这一点?
看来我误解了OP的要求。
现在,我明白OP想要找到
group这可以通过以非等值连接进行分组来覆盖要求 (1) 和 (3) 并随后过滤要求 (4) 并子集要求 (2) 来解决。最后,检索 . 受影响行的索引test.dt:
setorder(test.dt, group, -date)
idx <- test.dt[.(group = group, upper = date, lower = date - years(2)),
on = .(group, date <= upper, date >= lower), .N, by = .EACHI][
N >= 3, seq(.I[1L], length.out = N[1L]), by = group]$V1
test.dt[idx]
Run Code Online (Sandbox Code Playgroud)
Run Code Online (Sandbox Code Playgroud)group date idx age_yr 1: 1 2017-03-08 1 0.00000000 2: 1 2016-10-27 2 0.36164384 3: 1 2016-09-19 3 0.46575342 4: 1 2015-05-27 4 1.78356164 5: 2 2016-04-17 1 0.00000000 6: 2 2016-03-24 2 0.06575342 7: 2 2015-09-16 3 0.58630137 8: 2 2015-02-09 4 1.18630137 9: 2 2014-09-19 5 1.57808219 10: 2 2014-08-24 6 1.64931507 11: 2 2014-06-01 7 1.87945205 12: 2 2014-05-09 8 1.94246575 13: 2 2014-04-21 9 1.99178082 14: 3 2013-07-02 1 0.00000000 15: 3 2013-04-13 2 0.21917808 16: 3 2013-03-18 3 0.29041096 17: 3 2012-10-31 4 0.66849315 18: 3 2012-10-30 5 0.67123288 19: 3 2012-10-03 6 0.74520548 20: 3 2012-06-01 7 1.08493151 21: 4 2010-08-06 1 0.00000000 22: 4 2009-11-17 2 0.71780822 23: 4 2009-06-19 3 1.13150685 24: 4 2009-04-15 4 1.30958904 25: 4 2009-02-20 5 1.45753425 26: 4 2008-11-18 6 1.71506849 27: 4 2008-10-24 7 1.78356164 28: 5 2011-07-13 1 0.00000000 29: 5 2011-01-19 2 0.47945205 30: 5 2010-07-18 3 0.98630137 31: 5 2009-10-10 4 1.75616438 group date idx age_yr
请注意,我 在创建时使用了set.seed(1L)与IceCreamToucan 的答案test.dt相同的内容来比较两个结果。
如果我理解正确的话,OP 希望为每个组保留最近 3 个日期(无论多久)或从最近日期算起的过去 2 年内发生的所有日期(即使超过 3 个)。
下面的方法使用data.table特殊符号.I,在分组时保存原始 data.table 中的行号(或索引)x。
因此,每组的三个最近日期的索引可以通过以下方式确定
setorder(test.dt, group, -date)
test.dt[, .I[1:3], keyby = group]
Run Code Online (Sandbox Code Playgroud)
Run Code Online (Sandbox Code Playgroud)group V1 1: 1 1 2: 1 2 3: 1 3 4: 2 18 5: 2 19 6: 2 20 7: 3 48 8: 3 49 9: 3 50 10: 4 55 11: 4 56 12: 4 57 13: 5 64 14: 5 65 15: 5 66 16: 6 72 17: 6 73 18: 6 74
从最近日期算起的最近两年内发生的日期的索引可以由下式确定:
test.dt[, .I[max(date) <= date %m+% years(2)], keyby = group]
Run Code Online (Sandbox Code Playgroud)
这里,lubridate使用 的日期算术来避免闰年问题。
两组索引都可以使用union()删除重复索引的集合操作进行组合。然后使用这组索引对原始 data.table 进行子集化:
setorder(test.dt, group, -date)
test.dt[test.dt[, union(.I[1:3], .I[max(date) <= date %m+% years(2)]), keyby = group]$V1]
Run Code Online (Sandbox Code Playgroud)
Run Code Online (Sandbox Code Playgroud)group date idx age_yr 1: 1 2017-04-18 1 0.00000000 2: 1 2017-02-22 2 0.15068493 3: 1 2016-09-15 3 0.58904110 4: 1 2016-08-26 4 0.64383562 5: 1 2016-07-26 5 0.72876712 6: 1 2015-08-14 6 1.67945205 7: 2 2016-03-26 1 0.00000000 8: 2 2015-12-08 2 0.29863014 9: 2 2015-11-21 3 0.34520548 10: 2 2015-05-23 4 0.84383562 11: 2 2015-04-22 5 0.92876712 12: 2 2014-06-08 6 1.80000000 13: 3 2013-07-02 1 0.00000000 14: 3 2013-05-23 2 0.10958904 15: 3 2012-10-24 3 0.68767123 16: 3 2012-10-06 4 0.73698630 17: 3 2012-06-16 5 1.04383562 18: 3 2012-03-15 6 1.29863014 19: 3 2012-01-26 7 1.43287671 20: 4 2010-07-20 1 0.00000000 21: 4 2010-02-21 2 0.40821918 22: 4 2009-11-19 3 0.66575342 23: 4 2009-08-04 4 0.95890411 24: 4 2009-01-26 5 1.47945205 25: 4 2009-01-17 6 1.50410959 26: 4 2008-07-26 7 1.98356164 27: 5 2011-04-10 1 0.00000000 28: 5 2011-04-04 2 0.01643836 29: 5 2011-04-01 3 0.02465753 30: 5 2011-03-05 4 0.09863014 31: 5 2010-12-28 5 0.28219178 32: 5 2009-08-23 6 1.63013699 33: 5 2009-08-07 7 1.67397260 34: 6 2021-02-21 1 0.00000000 35: 6 2018-12-03 2 2.22191781 36: 6 2014-09-11 3 6.45205479 group date idx age_yr
请注意,已添加idx和age_yr来验证结果。
我添加了第六组日期,它代表了无论年龄如何都选择 3 个日期的用例。
set.seed(123L) # required for reproducible data
test.dt <- data.table(
group = c(
rep(1, times = 17),
rep(2, times = 30),
rep(3, times = 7),
rep(4, times = 9),
rep(5, times = 8),
rep(6, times = 5)
),
date = c(
sample(seq(dmy('28/8/2007'), dmy('3/10/2017'), by = 'day'), 17),
sample(seq(dmy('7/5/2007'), dmy('19/4/2016'), by = 'day'), 30),
sample(seq(dmy('28/12/2011'), dmy('3/10/2013'), by = 'day'), 7),
sample(seq(dmy('21/12/2007'), dmy('11/11/2010'),by = 'day'), 9),
sample(seq(dmy('27/8/2007'), dmy('5/2/2012'), by = 'day'), 8),
sample(seq(dmy('27/8/2001'), dmy('5/2/2029'), by = 'day'), 5)
)
)
# add data to verify result
test.dt[order(-date), idx := rowid(group)]
test.dt[, age_yr := as.integer(max(date) - date)/365, by = group]
test.dt
Run Code Online (Sandbox Code Playgroud)