Mic*_*ele 14 r data.table
我需要指定一个"第二个"id来将原始内容中的某些值分组id.这是我的样本数据:
dt<-structure(list(id = c("aaaa", "aaaa", "aaas", "aaas", "bbbb", "bbbb"),
period = c("start", "end", "start", "end", "start", "end"),
date = structure(c(15401L, 15401L, 15581L, 15762L, 15430L, 15747L), class = c("IDate", "Date"))),
class = c("data.table", "data.frame"),
.Names = c("id", "period", "date"),
sorted = "id")
> dt
id period date
1: aaaa start 2012-03-02
2: aaaa end 2012-03-05
3: aaas start 2012-08-21
4: aaas end 2013-02-25
5: bbbb start 2012-03-31
6: bbbb end 2013-02-11
Run Code Online (Sandbox Code Playgroud)
列id需要id2根据此列表进行分组(使用相同的值):
> groups
[[1]]
[1] "aaaa" "aaas"
[[2]]
[1] "bbbb"
Run Code Online (Sandbox Code Playgroud)
我使用下面的代码,似乎工作,给出以下内容warning:
> dt[, id2 := which(vapply(groups, function(x,y) any(x==y), .BY[[1]], FUN.VALUE=T)), by=id]
Warning message:
In `[.data.table`(dt, , `:=`(id2, which(vapply(groups, function(x, :
Invalid .internal.selfref detected and fixed by taking a copy of the whole table,
so that := can add this new column by reference. At an earlier point, this data.table has
been copied by R (or been created manually using structure() or similar). Avoid key<-,
names<- and attr<- which in R currently (and oddly) may copy the whole data.table. Use
set* syntax instead to avoid copying: setkey(), setnames() and setattr(). Also,
list (DT1,DT2) will copy the entire DT1 and DT2 (R's list() copies named objects),
use reflist() instead if needed (to be implemented). If this message doesn't help,
please report to datatable-help so the root cause can be fixed.
> dt
id period date id2
1: aaaa start 2012-03-02 1
2: aaaa end 2012-03-02 1
3: aaas start 2012-08-29 1
4: aaas end 2013-02-26 1
5: bbbb start 2012-03-31 2
6: bbbb end 2013-02-11 2
Run Code Online (Sandbox Code Playgroud)
有人可以简要解释这个警告的性质以及最终结果中的任何最终含义(如果有的话)?谢谢
编辑:
以下代码实际上显示何时dt创建以及如何传递给提供警告的函数:
f.main <- function(){
f2 <- function(x){
groups <- list(c("aaaa", "aaas"), "bbbb") # actually generated depending on the similarity between values of x$id
x <- x[, id2 := which(vapply(groups, function(x,y) any(x==y), .BY[[1]], FUN.VALUE=T)), by=id]
return(x)
}
x <- f1()
if(!is.null(x[["res"]])){
x <- f2(x[["res"]])
return(x)
} else {
# something else
}
}
f1 <- function(){
dt<-data.table(id = c("aaaa", "aaaa", "aaas", "aaas", "bbbb", "bbbb"),
period = c("start", "end", "start", "end", "start", "end"),
date = structure(c(15401L, 15401L, 15581L, 15762L, 15430L, 15747L), class = c("IDate", "Date")))
return(list(res=dt, other_results=""))
}
> f.main()
id period date id2
1: aaaa start 2012-03-02 1
2: aaaa end 2012-03-02 1
3: aaas start 2012-08-29 1
4: aaas end 2013-02-26 1
5: bbbb start 2012-03-31 2
6: bbbb end 2013-02-11 2
Warning message:
In `[.data.table`(x, , `:=`(id2, which(vapply(groups, function(x, :
Invalid .internal.selfref detected and fixed by taking a copy of the whole table,
so that := can add this new column by reference. At an earlier point, this data.table
has been copied by R (or been created manually using structure() or similar).
Avoid key<-, names<- and attr<- which in R currently (and oddly) may copy the whole
data.table. Use set* syntax instead to avoid copying: setkey(), setnames() and setattr().
Also, list(DT1,DT2) will copy the entire DT1 and DT2 (R's list() copies named objects),
use reflist() instead if needed (to be implemented). If this message doesn't help,
please report to datatable-help so the root cause can be fixed.
Run Code Online (Sandbox Code Playgroud)
Rol*_*and 12
是的,问题在于清单.这是一个简单的例子:
DT <- data.table(1:5)
mylist1 <- list(DT,"a")
mylist1[[1]][,id:=.I]
#warning
mylist2 <- list(data.table(1:5),"a")
mylist2[[1]][,id:=.I]
#no warning
Run Code Online (Sandbox Code Playgroud)
你应该避免将data.table复制到一个列表中(为了安全起见,我会避免在列表中放入DT).试试这个:
f1 <- function(){
mylist <- list(res=data.table(id = c("aaaa", "aaaa", "aaas", "aaas", "bbbb", "bbbb"),
period = c("start", "end", "start", "end", "start", "end"),
date = structure(c(15401L, 15401L, 15581L, 15762L, 15430L, 15747L), class = c("IDate", "Date"))))
other_results <- ""
mylist$other_results <- other_results
mylist
}
Run Code Online (Sandbox Code Playgroud)
Aru*_*run 11
您可以在创建列表时"浅拷贝",这样1)您不进行完整的内存复制(速度不受影响)和2)您没有得到内部参考错误(感谢@mnel这个技巧) .
set.seed(45)
ss <- function() {
tt <- sample(1:10, 1e6, replace=TRUE)
}
tt <- replicate(100, ss(), simplify=FALSE)
tt <- as.data.table(tt)
Run Code Online (Sandbox Code Playgroud)
system.time( {
ll <- list(d1 = { # shallow copy here...
data.table:::settruelength(tt, 0)
invisible(alloc.col(tt))
}, "a")
})
user system elapsed
0 0 0
> system.time(tt[, bla := 2])
user system elapsed
0.012 0.000 0.013
> system.time(ll[[1]][, bla :=2 ])
user system elapsed
0.008 0.000 0.010
Run Code Online (Sandbox Code Playgroud)
因此,您不要在速度上妥协,并且不会收到警告,然后是完整副本.希望这可以帮助.
"通过复制检测并修复了无效的.internal.selfref ..."
在f2()中分配id2时无需复制,您可以通过更改直接添加列:
# From:
x <- x[, id2 := which(vapply(groups, function(x,y) any(x==y), .BY[[1]], FUN.VALUE=T)), by=id]
# To something along the lines of:
x$id2 <- findInterval( match( x$id, unlist(groups)), cumsum(c(0,sapply(groups, length)))+1)
Run Code Online (Sandbox Code Playgroud)
然后你就可以像往常一样继续使用'x'data.table而不会发出警告.
此外,要简单地禁止警告,您可以在f2(x[["res"]])呼叫周围使用suppressWarnings().
即使在小桌子上,也会有很大的性能差异:
Performance Comparison:
Unit: milliseconds
expr min lq median uq max neval
f.main() 2.896716 2.982045 3.034334 3.137628 7.542367 100
suppressWarnings(f.main()) 3.005142 3.081811 3.133137 3.210126 5.363575 100
f.main.direct() 1.279303 1.384521 1.413713 1.486853 5.684363 100
Run Code Online (Sandbox Code Playgroud)