在 data.table R 中按组滚动

Hak*_*kki 1 grouping r data.table

我试图按组通过 data.table 滚动我的函数并遇到问题。不确定我应该更改我的功能还是我的调用错误。这是一个简单的例子:

数据

 test <- data.table(return=c(0.1, 0.1, 0.1, 0.1, 0.1, 0.2, 0.2, 0.2, 0.2, 0.2),
                   sec=c("A", "A", "A", "A", "A", "B", "B", "B", "B", "B"))
Run Code Online (Sandbox Code Playgroud)

我的职能

zoo_fun <- function(dt, N) {
  (rollapply(dt$return + 1, N, FUN=prod, fill=NA, align='right') - 1)
}
Run Code Online (Sandbox Code Playgroud)

运行它(我想创建新的列动量,这只是最新 ​​3 个观察值的乘积,每个观察值加一个(因此分组依据 = 秒)。

test[, momentum3 := zoo_fun(test, 3), by=sec]

    Warning messages:
    1: In `[.data.table`(test, , `:=`(momentum3, zoo_fun(test, 3)), by = sec) :
      RHS 1 is length 10 (greater than the size (5) of group 1). The last 5 element(s) will be discarded.
    2: In `[.data.table`(test, , `:=`(momentum3, zoo_fun(test, 3)), by = sec) :
      RHS 1 is length 10 (greater than the size (5) of group 2). The last 5 element(s) will be discarded.
Run Code Online (Sandbox Code Playgroud)

我收到警告并且结果不是预期的:

> test
    return sec momentum3
 1:    0.1   A        NA
 2:    0.1   A        NA
 3:    0.1   A     0.331
 4:    0.1   A     0.331
 5:    0.1   A     0.331
 6:    0.2   B        NA
 7:    0.2   B        NA
 8:    0.2   B     0.331
 9:    0.2   B     0.331
10:    0.2   B     0.331
Run Code Online (Sandbox Code Playgroud)

我预计 B 秒会充满 0.728 ((1.2*1.2*1.2) -1),并在开始时有两个 NA。我究竟做错了什么?滚动函数不能与分组一起使用吗?

Uwe*_*Uwe 5

这个答案建议使用reduce()shift()来解决滚动窗口问题data.table该基准测试表明这可能比zoo::rollapply().

test[, momentum := Reduce(`*`, shift(return + 1.0, 0:2, type="lag")) - 1, by = sec][]
#    return sec momentum
# 1:    0.1   A       NA
# 2:    0.1   A       NA
# 3:    0.1   A    0.331
# 4:    0.1   A    0.331
# 5:    0.1   A    0.331
# 6:    0.2   B       NA
# 7:    0.2   B       NA
# 8:    0.2   B    0.728
# 9:    0.2   B    0.728
#10:    0.2   B    0.728
Run Code Online (Sandbox Code Playgroud)

基准(10行,OP数据集)

microbenchmark::microbenchmark(
  zoo = test[, momentum := zoo_fun(return, 3), by = sec][],
  red  = test[, momentum := Reduce(`*`, shift(return + 1.0, 0:2, type="lag")) - 1, by = sec][],
  times = 100L
)
#Unit: microseconds
# expr      min       lq      mean   median        uq      max neval cld
#  zoo 2318.209 2389.131 2445.1707 2421.541 2466.1930 3108.382   100   b
#  red  562.465  625.413  663.4893  646.880  673.4715 1094.771   100  a 
Run Code Online (Sandbox Code Playgroud)

基准(100k 行)

为了验证小数据集的基准测试结果,构建了一个更大的数据集:

n_rows <- 1e4
test0 <- data.table(return = rep(as.vector(outer(1:5/100, 1:2/10, "+")), n_rows),
                   sec = rep(rep(c("A", "B"), each = 5L), n_rows))

test0
#        return sec
#     1:   0.11   A
#     2:   0.12   A
#     3:   0.13   A
#     4:   0.14   A
#     5:   0.15   A
#    ---           
# 99996:   0.21   B
# 99997:   0.22   B
# 99998:   0.23   B
# 99999:   0.24   B
#100000:   0.25   B
Run Code Online (Sandbox Code Playgroud)

由于test正在适当修改,每个基准测试运行都以test0.

microbenchmark::microbenchmark(
  copy = test <- copy(test0),
  zoo  = {
    test <- copy(test0)
    test[, momentum := zoo_fun(return, 3), by = sec][]
  },
  red  = {
    test <- copy(test0)
    test[, momentum := Reduce(`*`, shift(return + 1.0, 0:2, type="lag")) - 1, by = sec][]
  },
  times = 10L
)

#Unit: microseconds
# expr         min          lq         mean      median          uq         max neval cld
# copy     282.619     294.512     325.3261     298.424     350.272     414.983    10  a 
#  zoo 1129601.974 1144346.463 1188484.0653 1162598.499 1194430.395 1337727.279    10   b
#  red    3354.554    3439.095    6135.8794    5002.008    7695.948   11443.595    10  a 
Run Code Online (Sandbox Code Playgroud)

对于 100k 行,Reduce()/shift()方法比zoo::rollapply().


显然,对于预期结果有不同的解释。

为了研究这个问题,使用了修改后的数据集:

test <- data.table(return=c(0.1, 0.11, 0.12, 0.13, 0.14, 0.21, 0.22, 0.23, 0.24, 0.25),
                   sec=c("A", "A", "A", "A", "A", "B", "B", "B", "B", "B"))
test
#    return sec
# 1:   0.10   A
# 2:   0.11   A
# 3:   0.12   A
# 4:   0.13   A
# 5:   0.14   A
# 6:   0.21   B
# 7:   0.22   B
# 8:   0.23   B
# 9:   0.24   B
#10:   0.25   B
Run Code Online (Sandbox Code Playgroud)

请注意,return每个组中的值都在变化,这与 OP 的数据集不同,其中return每个sec组的值是恒定的。

这样,接受的答案( rollapply()) 返回

test[, momentum := zoo_fun(return, 3), by = sec][]
#    return sec momentum
# 1:   0.10   A       NA
# 2:   0.11   A       NA
# 3:   0.12   A 0.367520
# 4:   0.13   A 0.404816
# 5:   0.14   A 0.442784
# 6:   0.21   B       NA
# 7:   0.22   B       NA
# 8:   0.23   B 0.815726
# 9:   0.24   B 0.860744
#10:   0.25   B 0.906500
Run Code Online (Sandbox Code Playgroud)

Henrik 的回答返回:

test[test[ , tail(.I, 3), by = sec]$V1, res := prod(return + 1) - 1, by = sec][]
#    return sec      res
# 1:   0.10   A       NA
# 2:   0.11   A       NA
# 3:   0.12   A 0.442784
# 4:   0.13   A 0.442784
# 5:   0.14   A 0.442784
# 6:   0.21   B       NA
# 7:   0.22   B       NA
# 8:   0.23   B 0.906500
# 9:   0.24   B 0.906500
#10:   0.25   B 0.906500
Run Code Online (Sandbox Code Playgroud)

Reduce()/解决方案shift()返回:

test[, momentum := Reduce(`*`, shift(return + 1.0, 0:2, type="lag")) - 1, by = sec][]
#    return sec momentum
# 1:   0.10   A       NA
# 2:   0.11   A       NA
# 3:   0.12   A 0.367520
# 4:   0.13   A 0.404816
# 5:   0.14   A 0.442784
# 6:   0.21   B       NA
# 7:   0.22   B       NA
# 8:   0.23   B 0.815726
# 9:   0.24   B 0.860744
#10:   0.25   B 0.906500
Run Code Online (Sandbox Code Playgroud)