Edit 3:
I created a much shorter example of the memory leak. I hope it makes it much easier to reason about what's going on. As the iterations proceed, you see steadily increasing gc() VCell memory use, while memory use reported by tables() stays the same. Somehow, the unlist(.SD) call seems to be responsible. Here it is:
DT = data.table(k = 1:100, g = 1:20, val = rnorm(2e6))
for (i in 1:100){
tmp = DT[ , unlist(.SD), by = 'k'] …Run Code Online (Sandbox Code Playgroud) 我在组中使用按组引用分配时看到奇数内存使用情况data.table.这是一个简单的示例(请原谅示例的无关紧要):
N <- 1e6
dt <- data.table(id=round(rnorm(N)), value=rnorm(N))
gc()
for (i in seq(100)) {
dt[, value := value+1, by="id"]
}
gc()
tables()
Run Code Online (Sandbox Code Playgroud)
产生以下输出:
> gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 303909 16.3 597831 32.0 407500 21.8
Vcells 2442853 18.7 3260814 24.9 2689450 20.6
> for (i in seq(100)) {
+ dt[, value := value+1, by="id"]
+ }
> gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 315907 16.9 597831 32.0 407500 21.8 …Run Code Online (Sandbox Code Playgroud)