在我的工作中,我使用了几个表(客户详细信息,交易记录等).其中一些是非常大的(数百万行),我最近切换到data.table包(感谢马修).但是,它们中的一些非常小(几百行和4/5列),并被称为几次.因此,我开始考虑检索数据的[.data.table开销,而不是像已经清楚描述的那样设置()ting值,其中,无论表的大小如何设置在2微秒左右(取决于cpu).?set
但是,它似乎不等于set从data.table知道确切的行和列获取值.一种loopable [.data.table.
library(data.table)
library(microbenchmark)
m = matrix(1,nrow=100000,ncol=100)
DF = as.data.frame(m)
DT = as.data.table(m) # same data used in ?set
> microbenchmark(DF[3450,1] , DT[3450, V1], times=1000) # much more overhead in DT
Unit: microseconds
expr min lq median uq max neval
DF[3450, 1] 32.745 36.166 40.5645 43.497 193.533 1000
DT[3450, V1] 788.791 803.453 813.2270 832.287 5826.982 1000
> microbenchmark(DF$V1[3450], DT[3450, 1, with=F], times=1000) # using atomic vector …Run Code Online (Sandbox Code Playgroud) 我想将矩阵分配给a的多列子集,data.table但矩阵最终被视为列向量.例如,
dt1 <- data.table(a1=rnorm(5), a2=rnorm(5), a3=rnorm(5))
m1 <- matrix(rnorm(10), ncol=2)
dt1[,c("a1","a2")] <- m1
Warning messages:
1: In `[<-.data.table`(`*tmp*`, , c("a1", "a2"), value = c(-0.308851784175091, :
2 column matrix RHS of := will be treated as one vector
2: In `[<-.data.table`(`*tmp*`, , c("a1", "a2"), value = c(-0.308851784175091, :
Supplied 10 items to be assigned to 5 items of column 'a1' (5 unused)
3: In `[<-.data.table`(`*tmp*`, , c("a1", "a2"), value = c(-0.308851784175091, :
2 column matrix RHS of := will …Run Code Online (Sandbox Code Playgroud)