在引用查找之后将data.table与自身合并

Jus*_*tin 10 r data.table

如果我有data.tables DTneighbors:

set.seed(1)
library(data.table)
DT <- data.table(idx=rep(1:10, each=5), x=rnorm(50), y=letters[1:5], ok=rbinom(50, 1, 0.90))
n <- data.table(y=letters[1:5], y1=letters[c(2:5,1)])
Run Code Online (Sandbox Code Playgroud)

n是一个查找表.无论何时ok == 0,我想查找相应的y1in n并使用该值x和给定的值idx.举例来说,DT的第4行:

> DT
   idx          x y ok
1:   1 -0.6264538 a  1
2:   1  0.1836433 b  1
3:   1 -0.8356286 c  1
4:   1  1.5952808 d  0
5:   1  0.3295078 e  1
6:   2 -0.8204684 a  1
Run Code Online (Sandbox Code Playgroud)

y1ndIS e:

> n[y == 'd']
   y y1
1: d  e
Run Code Online (Sandbox Code Playgroud)

idx4行1.所以我会用:

> DT[idx == 1 & y == 'e', x]
[1] 0.3295078
Run Code Online (Sandbox Code Playgroud)

我希望我的输出data.table就像DT[ok == 0]所有x值被适当的n ['y1']值替换一样x:

> output
   idx          x y ok
1:   1  0.3295078 d  0
2:   2 -0.3053884 d  0
3:   3  0.3898432 a  0
4:   5  0.7821363 a  0
5:   7  1.3586800 e  0
6:   8  0.7631757 d  0
Run Code Online (Sandbox Code Playgroud)

我可以想到用基础R或者plyr......以及可能在周五晚些时候做到这一点的几种方法......但无论合并的顺序如何,这都需要data.table我!

Mat*_*wle 8

好问题.使用其他答案中的函数并将Blue的答案包装到函数中blue,以下内容如何.基准测试包括setkey所有情况下的时间.

red = function() {
    ans = DT[ok==0]
      # Faster than setkey(DT,ok)[J(0)] if the vector scan is just once
      # If lots of lookups to "ok" need to be done, then setkey may be worth it
      # If DT[,ok:=as.integer(ok)] can be done first, then ok==0L slightly faster

    # After extracting ans in the original order of DT, we can now set the key :
    setkey(DT,idx,y)
    setkey(n,y)

    # Now working with the reduced ans ...

    ans[,y1:=n[y,y1,mult="first"]]
    # Add a new column y1 by reference containing the lookup in n
    # mult="first" because we know n's key is unique, for speed (to save looking
    # for groups of matches in n). Future version of data.table won't need this.
    # Also, mult="first" has the advantage of dropping group columns (so we don't
    # need [[2L]]). mult="first"|"last" turns off by-without-by of mult="all".

    ans[,x:=DT[ans[,list(idx,y1)],x,mult="first"]]
    # Changes the contents of ans$x by reference. The ans[,list(idx,y1)] part is
    # how to pick the columns of ans to join to DT's key when they are not the key
    # columns of ans and not the first 1:n columns of ans. There is no need to key
    # ans, especially since that would change ans's order and not strictly answer
    # the question. If idx and y1 were columns 1 and 2 of (unkeyed) ans then we
    # wouldn't need that part, just
    #    ans[,x:=DT[ans,x,mult="first"]]
    # would do (relying on DT having 2 columns in its key). That has the advantage
    # of not copying the idx and y1 columns into a new data.table to pass as the i
    # DT. To save that copy y1 could be moved to column 2 using setcolorder first.

    redans <<- ans
    }
Run Code Online (Sandbox Code Playgroud)


crdt(1e5)
origDT = copy(DT)
benchmark(blue={DT=copy(origDT); system.time(blue())},
          red={DT=copy(origDT); system.time(red())},
          fun={DT=copy(origDT); system.time(fun(DT,n))},
          replications=3, order="relative")

test replications elapsed relative user.self sys.self user.child sys.child
 red            3   1.107    1.000     1.100    0.004          0         0
blue            3   5.797    5.237     5.660    0.120          0         0
 fun            3   8.255    7.457     8.041    0.184          0         0

crdt(1e6)
[ .. snip .. ]
test replications elapsed relative user.self sys.self user.child sys.child
 red            3  14.647    1.000    14.613    0.000          0         0
blue            3  87.589    5.980    87.197    0.124          0         0
 fun            3 197.243   13.466   195.240    0.644          0         0

identical(blueans[,list(idx,x,y,ok,y1)],redans[order(idx,y1)])
# [1] TRUE
Run Code Online (Sandbox Code Playgroud)

order需要的identical,因为red返回结果的顺序相同DT[ok==0],而blue似乎被责令y1在并列的情况idx.

如果y1在结果中不需要,可以立即删除(无论表大小)使用ans[,y1:=NULL]; 也就是说,这可以包括在上面以产生所讨论的确切结果,而不会影响时间.

  • @BlueMagister很棒,这就是全部.回答最后一部分是因为`ans [,list(idx,y1)]`首先运行,结果传递为`DT [...]`外部的`i`. (2认同)