复制后data.table中的参考问题

J.P*_*ier 6 r data.table

我有一个复杂的问题,关于data.table嵌套在另一个data.table. 我能够在下面的可重现示例中重现该行为。

对不起,它仍然很长,需要一些时间才能完全理解,但它是我能够产生的更短的时间来指出我的问题。

假设我创建了以下data.table名称,data_1其中包含单个类型的列data.table

library(data.table)

set.seed(20200602L)

data_1 <- data.table(
  foo = replicate(5L, {
    data.table(
      bar = lapply(sample(3L, 5L, replace=TRUE), rpois, 1)
    )
  }, simplify=FALSE)
)

data_1[]
##              foo
##  1: <data.table>
##  2: <data.table>
##  3: <data.table>
##  4: <data.table>
##  5: <data.table>
Run Code Online (Sandbox Code Playgroud)

可以探索foo以下专栏的内容:

data_1[, foo]
##  [[1]]
##       bar
##  1: 4,0,1
##  2:   0,2
##  3: 1,3,2
##  4:   1,1
##  5:     0
##  
##  [[2]]
##     bar
##  1:   2
##  2: 0,3
##  3:   0
##  4: 2,3
##  5: 0,0
##  
##  [[3]]
##       bar
##  1: 0,1,1
##  2: 1,2,1
##  3:   2,1
##  4:     1
##  5:     1
##  
##  [[4]]
##       bar
##  1:     1
##  2:   3,3
##  3:     0
##  4:   2,2
##  5: 0,0,0
##  
##  [[5]]
##     bar
##  1: 0,0
##  2: 0,0
##  3: 0,1
##  4: 2,1
##  5:   0
Run Code Online (Sandbox Code Playgroud)

然后我想创建一个函数fun(),该函数将向列baz中的每个元素添加一列foo。此列baz将反映bar如下所示的列表:

fun <- function(data) {

  data[, .(lapply(foo, function(x) {
    x[, baz:=lapply(bar, function(y) {
      rev(y)
    })]
  }))]

}
Run Code Online (Sandbox Code Playgroud)

在将该函数应用于 之前data_1,我会将其复制到 中,data_2因为我需要保持原件完好无损。

data_2 <- copy(data_1)

invisible(fun(data_1))

data_1[, foo]
##  [[1]]
##       bar   baz
##  1: 4,0,1 1,0,4
##  2:   0,2   2,0
##  3: 1,3,2 2,3,1
##  4:   1,1   1,1
##  5:     0     0
##  
##  [[2]]
##     bar baz
##  1:   2   2
##  2: 0,3 3,0
##  3:   0   0
##  4: 2,3 3,2
##  5: 0,0 0,0
##  
##  [[3]]
##       bar   baz
##  1: 0,1,1 1,1,0
##  2: 1,2,1 1,2,1
##  3:   2,1   1,2
##  4:     1     1
##  5:     1     1
##  
##  [[4]]
##       bar   baz
##  1:     1     1
##  2:   3,3   3,3
##  3:     0     0
##  4:   2,2   2,2
##  5: 0,0,0 0,0,0
##  
##  [[5]]
##     bar baz
##  1: 0,0 0,0
##  2: 0,0 0,0
##  3: 0,1 1,0
##  4: 2,1 1,2
##  5:   0   0
Run Code Online (Sandbox Code Playgroud)

人们可以仔细检查data_2仍然完好无损:

data_2[, foo]
##  [[1]]
##       bar
##  1: 4,0,1
##  2:   0,2
##  3: 1,3,2
##  4:   1,1
##  5:     0
##  
##  [[2]]
##     bar
##  1:   2
##  2: 0,3
##  3:   0
##  4: 2,3
##  5: 0,0
##  
##  [[3]]
##       bar
##  1: 0,1,1
##  2: 1,2,1
##  3:   2,1
##  4:     1
##  5:     1
##  
##  [[4]]
##       bar
##  1:     1
##  2:   3,3
##  3:     0
##  4:   2,2
##  5: 0,0,0
##  
##  [[5]]
##     bar
##  1: 0,0
##  2: 0,0
##  3: 0,1
##  4: 2,1
##  5:   0
Run Code Online (Sandbox Code Playgroud)

到那时,一切看起来都很好。但是,让我们说我改变了主意,我想给函数适用fun()data_2为好。我原以为它的工作方式与data_1. 不幸的是,它不是:

invisible(fun(data_2))
##  Warning messages:
##  1: In `[.data.table`(x, , `:=`(baz, lapply(bar, function(y) { :
##    Invalid .internal.selfref detected and fixed by taking a (shallow) copy of the data.table so that := can add this new column by reference. At an earlier point, this data.table has been copied by R (or was created manually using structure() or similar). Avoid names<- and attr<- which in R currently (and oddly) may copy the whole data.table. Use set* syntax instead to avoid copying: ?set, ?setnames and ?setattr. If this message doesn't help, please report your use case to the data.table issue tracker so the root cause can be fixed or this message improved.
##  2: In `[.data.table`(x, , `:=`(baz, lapply(bar, function(y) { :
##    Invalid .internal.selfref detected and fixed by taking a (shallow) copy of the data.table so that := can add this new column by reference. At an earlier point, this data.table has been copied by R (or was created manually using structure() or similar). Avoid names<- and attr<- which in R currently (and oddly) may copy the whole data.table. Use set* syntax instead to avoid copying: ?set, ?setnames and ?setattr. If this message doesn't help, please report your use case to the data.table issue tracker so the root cause can be fixed or this message improved.
##  3: In `[.data.table`(x, , `:=`(baz, lapply(bar, function(y) { :
##    Invalid .internal.selfref detected and fixed by taking a (shallow) copy of the data.table so that := can add this new column by reference. At an earlier point, this data.table has been copied by R (or was created manually using structure() or similar). Avoid names<- and attr<- which in R currently (and oddly) may copy the whole data.table. Use set* syntax instead to avoid copying: ?set, ?setnames and ?setattr. If this message doesn't help, please report your use case to the data.table issue tracker so the root cause can be fixed or this message improved.
##  4: In `[.data.table`(x, , `:=`(baz, lapply(bar, function(y) { :
##    Invalid .internal.selfref detected and fixed by taking a (shallow) copy of the data.table so that := can add this new column by reference. At an earlier point, this data.table has been copied by R (or was created manually using structure() or similar). Avoid names<- and attr<- which in R currently (and oddly) may copy the whole data.table. Use set* syntax instead to avoid copying: ?set, ?setnames and ?setattr. If this message doesn't help, please report your use case to the data.table issue tracker so the root cause can be fixed or this message improved.
##  5: In `[.data.table`(x, , `:=`(baz, lapply(bar, function(y) { :
##    Invalid .internal.selfref detected and fixed by taking a (shallow) copy of the data.table so that := can add this new column by reference. At an earlier point, this data.table has been copied by R (or was created manually using structure() or similar). Avoid names<- and attr<- which in R currently (and oddly) may copy the whole data.table. Use set* syntax instead to avoid copying: ?set, ?setnames and ?setattr. If this message doesn't help, please report your use case to the data.table issue tracker so the root cause can be fixed or this message improved.

data_2[, foo]
##  [[1]]
##       bar
##  1: 4,0,1
##  2:   0,2
##  3: 1,3,2
##  4:   1,1
##  5:     0
##  
##  [[2]]
##     bar
##  1:   2
##  2: 0,3
##  3:   0
##  4: 2,3
##  5: 0,0
##  
##  [[3]]
##       bar
##  1: 0,1,1
##  2: 1,2,1
##  3:   2,1
##  4:     1
##  5:     1
##  
##  [[4]]
##       bar
##  1:     1
##  2:   3,3
##  3:     0
##  4:   2,2
##  5: 0,0,0
##  
##  [[5]]
##     bar
##  1: 0,0
##  2: 0,0
##  3: 0,1
##  4: 2,1
##  5:   0
Run Code Online (Sandbox Code Playgroud)

有人可以解释我为什么,也许可以指出我解决问题的方法吗?


参考

sessionInfo()
##  R version 4.0.0 (2020-04-24)
##  Platform: x86_64-pc-linux-gnu (64-bit)
##  Running under: SUSE Linux Enterprise Server 12 SP5
##  
##  Matrix products: default
##  BLAS:   /apps/R-4.0.0/lib/libRblas.so
##  LAPACK: /apps/R-4.0.0/lib/libRlapack.so
##  
##  locale:
##   [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##   [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
##  
##  attached base packages:
##  [1] stats     graphics  grDevices utils     datasets  methods   base     
##  
##  other attached packages:
##  [1] data.table_1.12.8
##  
##  loaded via a namespace (and not attached):
##  [1] compiler_4.0.0 tools_4.0.0 
Run Code Online (Sandbox Code Playgroud)

Mic*_*ico 5

.internal.selfref没有被更新copy为构成data.tableS:

all.equal(
  lapply(data_1$foo, attr, '.internal.selfref'), 
  lapply(data_2$foo, attr, '.internal.selfref')
)
# [1] TRUE
Run Code Online (Sandbox Code Playgroud)

这需要更新;您可以通过alloc.col在复制的data.tables上运行来解决问题:

data_2 = copy(data_1)
# also possible to do lapply(foo, copy), but this should be slower
data_2[ , foo := lapply(foo, alloc.col)]

invisible(fun(data_1))

invisible(fun(data_2))
Run Code Online (Sandbox Code Playgroud)