Jus*_*tin 10 r data.table
如果我有data.tables
DT
和neighbors
:
set.seed(1)
library(data.table)
DT <- data.table(idx=rep(1:10, each=5), x=rnorm(50), y=letters[1:5], ok=rbinom(50, 1, 0.90))
n <- data.table(y=letters[1:5], y1=letters[c(2:5,1)])
Run Code Online (Sandbox Code Playgroud)
n
是一个查找表.无论何时ok == 0
,我想查找相应的y1
in n
并使用该值x
和给定的值idx
.举例来说,DT的第4行:
> DT
idx x y ok
1: 1 -0.6264538 a 1
2: 1 0.1836433 b 1
3: 1 -0.8356286 c 1
4: 1 1.5952808 d 0
5: 1 0.3295078 e 1
6: 2 -0.8204684 a 1
Run Code Online (Sandbox Code Playgroud)
该y1
从n
对d
IS e
:
> n[y == 'd']
y y1
1: d e
Run Code Online (Sandbox Code Playgroud)
和idx
4行1.所以我会用:
> DT[idx == 1 & y == 'e', x]
[1] 0.3295078
Run Code Online (Sandbox Code Playgroud)
我希望我的输出data.table
就像DT[ok == 0]
所有x
值被适当的n ['y1']值替换一样x
:
> output
idx x y ok
1: 1 0.3295078 d 0
2: 2 -0.3053884 d 0
3: 3 0.3898432 a 0
4: 5 0.7821363 a 0
5: 7 1.3586800 e 0
6: 8 0.7631757 d 0
Run Code Online (Sandbox Code Playgroud)
我可以想到用基础R或者plyr
......以及可能在周五晚些时候做到这一点的几种方法......但无论合并的顺序如何,这都需要data.table
我!
好问题.使用其他答案中的函数并将Blue的答案包装到函数中blue
,以下内容如何.基准测试包括setkey
所有情况下的时间.
red = function() {
ans = DT[ok==0]
# Faster than setkey(DT,ok)[J(0)] if the vector scan is just once
# If lots of lookups to "ok" need to be done, then setkey may be worth it
# If DT[,ok:=as.integer(ok)] can be done first, then ok==0L slightly faster
# After extracting ans in the original order of DT, we can now set the key :
setkey(DT,idx,y)
setkey(n,y)
# Now working with the reduced ans ...
ans[,y1:=n[y,y1,mult="first"]]
# Add a new column y1 by reference containing the lookup in n
# mult="first" because we know n's key is unique, for speed (to save looking
# for groups of matches in n). Future version of data.table won't need this.
# Also, mult="first" has the advantage of dropping group columns (so we don't
# need [[2L]]). mult="first"|"last" turns off by-without-by of mult="all".
ans[,x:=DT[ans[,list(idx,y1)],x,mult="first"]]
# Changes the contents of ans$x by reference. The ans[,list(idx,y1)] part is
# how to pick the columns of ans to join to DT's key when they are not the key
# columns of ans and not the first 1:n columns of ans. There is no need to key
# ans, especially since that would change ans's order and not strictly answer
# the question. If idx and y1 were columns 1 and 2 of (unkeyed) ans then we
# wouldn't need that part, just
# ans[,x:=DT[ans,x,mult="first"]]
# would do (relying on DT having 2 columns in its key). That has the advantage
# of not copying the idx and y1 columns into a new data.table to pass as the i
# DT. To save that copy y1 could be moved to column 2 using setcolorder first.
redans <<- ans
}
Run Code Online (Sandbox Code Playgroud)
crdt(1e5)
origDT = copy(DT)
benchmark(blue={DT=copy(origDT); system.time(blue())},
red={DT=copy(origDT); system.time(red())},
fun={DT=copy(origDT); system.time(fun(DT,n))},
replications=3, order="relative")
test replications elapsed relative user.self sys.self user.child sys.child
red 3 1.107 1.000 1.100 0.004 0 0
blue 3 5.797 5.237 5.660 0.120 0 0
fun 3 8.255 7.457 8.041 0.184 0 0
crdt(1e6)
[ .. snip .. ]
test replications elapsed relative user.self sys.self user.child sys.child
red 3 14.647 1.000 14.613 0.000 0 0
blue 3 87.589 5.980 87.197 0.124 0 0
fun 3 197.243 13.466 195.240 0.644 0 0
identical(blueans[,list(idx,x,y,ok,y1)],redans[order(idx,y1)])
# [1] TRUE
Run Code Online (Sandbox Code Playgroud)
在order
需要的identical
,因为red
返回结果的顺序相同DT[ok==0]
,而blue
似乎被责令y1
在并列的情况idx
.
如果y1
在结果中不需要,可以立即删除(无论表大小)使用ans[,y1:=NULL]
; 也就是说,这可以包括在上面以产生所讨论的确切结果,而不会影响时间.
归档时间: |
|
查看次数: |
1119 次 |
最近记录: |