我正在 R 中使用 data.tables。数据有多个记录,我正在尝试使用 .SD data.table 选项为每个人找到第 n 条记录。如果我将 N 指定为整数,则会立即创建新的 data.table。但是如果 N 是一个变量(就像它可能在函数中一样),代码需要大约 700 倍的时间。对于大型数据集,这是一个问题。我想知道这是否是一个已知问题,是否有任何方法可以加快速度?
library(data.table)
library(microbenchmark)
set.seed(102938)
dd <- data.table(id = rep(1:10000, each = 10), seq = seq(1:10))
setkey(dd, id)
N <- 2
microbenchmark(dd[,.SD[2], keyby = id],
dd[,.SD[N], keyby = id],
times = 5)
Run Code Online (Sandbox Code Playgroud)
#> Unit: microseconds
#> expr min lq mean median
#> dd[, .SD[2], keyby = id] 886.269 1584.513 2904.497 1851.356
#> dd[, .SD[N], keyby = id] 770822.875 810131.784 870418.622 903956.708
#> uq max neval
#> 1997.134 8203.214 5
#> 912223.026 954958.718 5
Run Code Online (Sandbox Code Playgroud)
使用行索引 ( ) 进行子集化可能会更好,.I而不是.SD
dd[dd[, .I[N], keyby = id]$V1]
Run Code Online (Sandbox Code Playgroud)
- 基准测试
microbenchmark(dd[,.SD[2], keyby = id],
dd[dd[,.I[N], keyby = id]$V1],
times = 5)
#Unit: milliseconds
# expr min lq mean median uq max neval
# dd[, .SD[2], keyby = id] 1.253097 1.343862 2.796684 1.352426 1.400910 8.633126 5
# dd[dd[, .I[N], keyby = id]$V1] 5.082752 5.383201 5.991076 5.866084 6.488898 7.134443 5
Run Code Online (Sandbox Code Playgroud)
使用.I,它比 得到了更好的改进.SD,但仍然存在性能影响,并且这将是在全局环境中查找变量“N”的搜索时间
在内部,优化在计时中发挥着作用。如果我们使用,则使用该选项的所有优化都是 FALSE0
options(datatable.optimize = 0L)
microbenchmark(dd[,.SD[2], keyby = id],
dd[dd[,.I[N], keyby = id]$V1],
times = 5)
#Unit: milliseconds
# expr min lq mean median uq max neval
# dd[, .SD[2], keyby = id] 660.612463 701.573252 761.51163 776.780341 785.940196 882.651875 5
#dd[dd[, .I[N], keyby = id]$V1] 3.860492 4.140469 5.05796 4.762518 5.342907 7.183416 5
Run Code Online (Sandbox Code Playgroud)
现在,该.I方法更快
更改为 1
options(datatable.optimize = 1L)
microbenchmark(dd[,.SD[2], keyby = id],
dd[dd[,.I[N], keyby = id]$V1],
times = 5)
#Unit: milliseconds
# expr min lq mean median uq max neval
# dd[, .SD[2], keyby = id] 4.934761 5.109478 5.496449 5.414477 5.868185 6.155342 5
# dd[dd[, .I[N], keyby = id]$V1] 3.923388 3.966413 4.325268 4.379745 4.494367 4.862426 5
Run Code Online (Sandbox Code Playgroud)
用2-gforce优化-默认方法
options(datatable.optimize = 2L)
microbenchmark(dd[,.SD[2], keyby = id],
dd[dd[,.I[N], keyby = id]$V1],
times = 5)
#Unit: milliseconds
# expr min lq mean median uq max neval
# dd[, .SD[2], keyby = id] 1.113463 1.179071 1.245787 1.205013 1.337216 1.394174 5
# dd[dd[, .I[N], keyby = id]$V1] 4.339619 4.523917 4.774221 4.833648 5.017755 5.156166 5
Run Code Online (Sandbox Code Playgroud)
可以通过以下方式检查后台优化verbose = TRUE
out1 <- dd[,.SD[2], keyby = id, verbose = TRUE]
#Finding groups using forderv ... 0.017s elapsed (0.020s cpu)
#Finding group sizes from the positions (can be avoided to save RAM) ... 0.022s #elapsed (0.131s cpu)
#lapply optimization changed j from '.SD[2]' to 'list(seq[2])'
#GForce optimized j to 'list(`g[`(seq, 2))'
#Making each group and running j (GForce TRUE) ... 0.027s elapsed (0.159s cpu)
out2 <- dd[dd[,.I[N], keyby = id, verbose = TRUE]$V1, verbose = TRUE]
#Detected that j uses these columns: <none>
#Finding groups using forderv ... 0.023s elapsed (0.026s cpu)
#Finding group sizes from the positions (can be avoided to save RAM) ... 0.022s #elapsed (0.128s cpu)
#lapply optimization is on, j unchanged as '.I[N]'
#GForce is on, left j unchanged
#Old mean optimization is on, left j unchanged.
#Making each group and running j (GForce FALSE) ...
# memcpy contiguous groups took 0.052s for 10000 groups
# eval(j) took 0.065s for 10000 calls #######
#0.068s elapsed (0.388s cpu)
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
56 次 |
| 最近记录: |