R 使用 data.table 更快地通过整个组创建滞后的方法

Sau*_*abh 3 r data.table

我有一个简单的 data.table 如下 -

  ID = c(rep("A", 1000), rep("B", 1000), rep("C", 1000), rep("D", 1000))
  val = c("a", "a", "a", "b", "b", "c", "c","d","d","d","d","e","e","f","f","g","g","g","g","g")

  dt = data.table(ID, val)
Run Code Online (Sandbox Code Playgroud)

val我想向此 data.table 添加一个新列,该列将具有by group的滞后ID

这是预期的输出

> head(dt, 20)
     ID val val_lag
 1:  A   a    <NA>
 2:  A   a    <NA>
 3:  A   a    <NA>
 4:  A   b       a
 5:  A   b       a
 6:  A   c       b
 7:  A   c       b
 8:  A   d       c
 9:  A   d       c
10:  A   d       c
11:  A   d       c
12:  A   e       d
13:  A   e       d
14:  A   f       e
15:  A   f       e
16:  A   g       f
17:  A   g       f
18:  A   g       f
19:  A   g       f
20:  A   g       f
Run Code Online (Sandbox Code Playgroud)

我当前使用的解决方案是 -

dt[, val_lag := with(rle(val), rep(c(NA, head(values, -1)), lengths)), by = ID]
Run Code Online (Sandbox Code Playgroud)

然而,这个解决方案在实际数据集上速度非常慢,该数据集非常大并且有数百万行。有没有更快的方法来解决这个问题?

以下是本文讨论的所有方法的性能结果 -

  microbenchmark::microbenchmark(rles = dt[, val_lag1 := with(rle(val), rep(c(NA, head(values, -1)), lengths)), by = ID],
                                 chinsoon = dt[, val_lag := shift(val)[nafill(replace(seq.int(.N), rowid(rleid(val)) > 1L, NA_integer_), "locf")], by = ID],
                                 TiC = dt[, val_lag3 := c(NA,rle(val)$values)[cumsum(c(0,head(val,-1)!=tail(val,-1)))+1], by = ID],
                                 times = 1000
  )

Unit: milliseconds
     expr      min       lq     mean   median       uq      max neval cld
     rles 1.549548 1.781014 2.750187 2.096805 2.743668 46.65326  1000  a 
 chinsoon 1.766827 2.060233 3.059109 2.379477 3.077080 67.16040  1000  a 
      TiC 1.986808 2.226933 3.472451 2.624236 3.397165 60.67802  1000   b
Run Code Online (Sandbox Code Playgroud)

谢谢!

chi*_*n12 5

这是另一种选择:

dt[, val_lag := shift(val)[nafill(replace(seq.int(.N), rowid(rleid(val)) > 1L, NA_integer_), "locf")]]
Run Code Online (Sandbox Code Playgroud)

计时代码:

library(data.table)
set.seed(0L)
nr <- 1e6
ng <- 1e5
dt = data.table(ID=sample(ng, nr, TRUE), val=as.character(sample(nr, nr, TRUE)))
setorder(dt, ID, val)

microbenchmark::microbenchmark(times = 3L,
    opt = dt[, val_lag := shift(val)[nafill(replace(seq.int(.N), rowid(rleid(val)) > 1L, NA_integer_), "locf")]],
    rle = dt[, val_lag := with(rle(val), rep(c(NA, head(values, -1)), lengths)), by = ID]
)
    
Run Code Online (Sandbox Code Playgroud)

时间安排:

Unit: milliseconds
 expr       min        lq      mean    median        uq       max neval
  opt  133.8857  159.8922  265.2029  185.8987  330.8614  475.8242     3
  rle 3097.6005 3123.5422 3193.2654 3149.4839 3241.0978 3332.7117     3
Run Code Online (Sandbox Code Playgroud)

编辑:添加了正在发生的事情的示例:

index         |    1    2    3    4    5    6    7    8    9   10
value         |    a    a    a    b    b    c    c    c    d    d

shifted (s)   |   NA    a    a    a    b    b    c    c    c    d
rowid+rleid   |    1    2    3    1    2    1    2    3    1    2
replace       |    1   NA   NA    4   NA    6   NA   NA    9   NA <In ?nafill, Only double and integer data types are currently supported. Hence, nafill the indices before accessing>
nafill        |    1    1    1    4    4    6    6    6    9    9
using s above | s[1] s[1] s[1] s[4] s[4] s[6] s[6] s[6] s[9] s[9]
Run Code Online (Sandbox Code Playgroud)