data.table中的.internal.selfref无效

Mic*_*ele 14 r data.table

我需要指定一个"第二个"id来将原始内容中的某些值分组id.这是我的样本数据:

dt<-structure(list(id = c("aaaa", "aaaa", "aaas", "aaas", "bbbb", "bbbb"),
                   period = c("start", "end", "start", "end", "start", "end"),
                   date = structure(c(15401L, 15401L, 15581L, 15762L, 15430L, 15747L), class = c("IDate", "Date"))),
              class = c("data.table", "data.frame"),
              .Names = c("id", "period", "date"),
              sorted = "id")
> dt
     id period       date
1: aaaa  start 2012-03-02
2: aaaa    end 2012-03-05
3: aaas  start 2012-08-21
4: aaas    end 2013-02-25
5: bbbb  start 2012-03-31
6: bbbb    end 2013-02-11
Run Code Online (Sandbox Code Playgroud)

id需要id2根据此列表进行分组(使用相同的值):

> groups
[[1]]
[1] "aaaa" "aaas"

[[2]]
[1] "bbbb"
Run Code Online (Sandbox Code Playgroud)

我使用下面的代码,似乎工作,给出以下内容warning:

    > dt[, id2 := which(vapply(groups, function(x,y) any(x==y), .BY[[1]], FUN.VALUE=T)), by=id]
    Warning message:
    In `[.data.table`(dt, , `:=`(id2, which(vapply(groups, function(x,  :
      Invalid .internal.selfref detected and fixed by taking a copy of the whole table,
so that := can add this new column by reference. At an earlier point, this data.table has
been copied by R (or been created manually using structure() or similar). Avoid key<-,
names<- and attr<- which in R currently (and oddly) may copy the whole data.table. Use
set* syntax instead to avoid copying: setkey(), setnames() and setattr(). Also,
list (DT1,DT2) will copy the entire DT1 and DT2 (R's list() copies named objects),
use reflist() instead if needed (to be implemented). If this message doesn't help,
please report to datatable-help so the root cause can be fixed.
    > dt
         id period       date id2
    1: aaaa  start 2012-03-02   1
    2: aaaa    end 2012-03-02   1
    3: aaas  start 2012-08-29   1
    4: aaas    end 2013-02-26   1
    5: bbbb  start 2012-03-31   2
    6: bbbb    end 2013-02-11   2
Run Code Online (Sandbox Code Playgroud)

有人可以简要解释这个警告的性质以及最终结果中的任何最终含义(如果有的话)?谢谢

编辑:

以下代码实际上显示何时dt创建以及如何传递给提供警告的函数:

f.main <- function(){
      f2 <- function(x){
      groups <- list(c("aaaa", "aaas"), "bbbb") # actually generated depending on the similarity between values of x$id
      x <- x[, id2 := which(vapply(groups, function(x,y) any(x==y), .BY[[1]], FUN.VALUE=T)), by=id]
      return(x)
  }
  x <- f1()
  if(!is.null(x[["res"]])){
    x <- f2(x[["res"]])
    return(x)
  } else {
    # something else
  }
}

f1 <- function(){
  dt<-data.table(id = c("aaaa", "aaaa", "aaas", "aaas", "bbbb", "bbbb"),
                 period = c("start", "end", "start", "end", "start", "end"),
                 date = structure(c(15401L, 15401L, 15581L, 15762L, 15430L, 15747L), class = c("IDate", "Date")))
  return(list(res=dt, other_results=""))
}

> f.main()
     id period       date id2
1: aaaa  start 2012-03-02   1
2: aaaa    end 2012-03-02   1
3: aaas  start 2012-08-29   1
4: aaas    end 2013-02-26   1
5: bbbb  start 2012-03-31   2
6: bbbb    end 2013-02-11   2
Warning message:
In `[.data.table`(x, , `:=`(id2, which(vapply(groups, function(x,  :
  Invalid .internal.selfref detected and fixed by taking a copy of the whole table,
so that := can add this new column by reference. At an earlier point, this data.table
has been copied by R (or been created manually using structure() or similar).
Avoid key<-, names<- and attr<- which in R currently (and oddly) may copy the whole
data.table. Use set* syntax instead to avoid copying: setkey(), setnames() and setattr().
Also, list(DT1,DT2) will copy the entire DT1 and DT2 (R's list() copies named objects),
use reflist() instead if needed (to be implemented). If this message doesn't help,
please report to datatable-help so the root cause can be fixed.
Run Code Online (Sandbox Code Playgroud)

Rol*_*and 12

是的,问题在于清单.这是一个简单的例子:

DT <- data.table(1:5)
mylist1 <- list(DT,"a")
mylist1[[1]][,id:=.I]
#warning

mylist2 <- list(data.table(1:5),"a")
mylist2[[1]][,id:=.I]
#no warning
Run Code Online (Sandbox Code Playgroud)

你应该避免将data.table复制到一个列表中(为了安全起见,我会避免在列表中放入DT).试试这个:

f1 <- function(){
  mylist <- list(res=data.table(id = c("aaaa", "aaaa", "aaas", "aaas", "bbbb", "bbbb"),
                 period = c("start", "end", "start", "end", "start", "end"),
                 date = structure(c(15401L, 15401L, 15581L, 15762L, 15430L, 15747L), class = c("IDate", "Date"))))
  other_results <- ""
  mylist$other_results <- other_results
  mylist
}
Run Code Online (Sandbox Code Playgroud)

  • 当然,但是对于通常很大的data.tables,目标是避免复制.这是该软件包的主要优点之一. (2认同)

Aru*_*run 11

您可以在创建列表时"浅拷贝",这样1)您不进行完整的内存复制(速度不受影响)和2)您没有得到内部参考错误(感谢@mnel这个技巧) .

创建数据:

set.seed(45)
ss <- function() {
    tt <- sample(1:10, 1e6, replace=TRUE)
}
tt <- replicate(100, ss(), simplify=FALSE)
tt <- as.data.table(tt)
Run Code Online (Sandbox Code Playgroud)

你应该如何创建列表(浅拷贝):

system.time( {
    ll <- list(d1 = { # shallow copy here...
        data.table:::settruelength(tt, 0)
        invisible(alloc.col(tt))
    }, "a")
})
user  system elapsed
   0       0       0
> system.time(tt[, bla := 2])
   user  system elapsed
  0.012   0.000   0.013
> system.time(ll[[1]][, bla :=2 ])
   user  system elapsed
  0.008   0.000   0.010
Run Code Online (Sandbox Code Playgroud)

因此,您不要在速度上妥协,并且不会收到警告,然后是完整副本.希望这可以帮助.


The*_*ell 6

"通过复制检测并修复了无效的.internal.selfref ..."

在f2()中分配id2时无需复制,您可以通过更改直接添加列:

# From:

      x <- x[, id2 := which(vapply(groups, function(x,y) any(x==y), .BY[[1]], FUN.VALUE=T)), by=id]

# To something along the lines of:
      x$id2 <- findInterval( match( x$id, unlist(groups)), cumsum(c(0,sapply(groups, length)))+1)
Run Code Online (Sandbox Code Playgroud)

然后你就可以像往常一样继续使用'x'data.table而不会发出警告.

此外,要简单地禁止警告,您可以在f2(x[["res"]])呼叫周围使用suppressWarnings().

即使在小桌子上,也会有很大的性能差异:

Performance Comparison:
Unit: milliseconds
                       expr      min       lq   median       uq      max neval
                   f.main() 2.896716 2.982045 3.034334 3.137628 7.542367   100
 suppressWarnings(f.main()) 3.005142 3.081811 3.133137 3.210126 5.363575   100
            f.main.direct() 1.279303 1.384521 1.413713 1.486853 5.684363   100
Run Code Online (Sandbox Code Playgroud)