Alg*_*man 13 join r data.table
使用data.table,我们可以使用 将一个数据集中的值与另一个数据集中的最近值连接起来roll = "nearest"。一些示例数据:
dt1 <- data.table(x = c(15,101), id1 = c("x", "y"))
dt2 <- data.table(x = c(10,50,100,200), id2 = c("a","b","c","d"))
Run Code Online (Sandbox Code Playgroud)
使用roll = "nearest",我可以将“dt1”中的每个“x”与 dt2 中最接近的“x”连接起来:
dt2[dt1, roll = "nearest", on = "x"]
# x id2 id1
# 1: 15 a x
# 2: 101 c y
Run Code Online (Sandbox Code Playgroud)
例如,x = 15在 'dt1' 中,x'dt2' 中最接近的值是x = 10,我们得到相应的 'id2',即"a"。
但是,如果不是获得一个最近的值,而是想要获得n 个最近的值呢?例如,如果我想要2 个最接近的 x 值,结果将是:
x id2 id1 roll
1: 15 a x nr1
2: 15 b x nr2
3: 101 c y nr1
4: 101 b y nr2
Run Code Online (Sandbox Code Playgroud)
(“nr”代表“最近的”)
我想要一种可以应用于任何“n”的通用方法(例如 2 个最近的点、3 个最近的点等)。
编辑 我想知道是否也可以将此应用于多列连接,其中连接将在前一列上匹配,然后在最后一个连接列上获得最近的连接。例如:
dt1 <- data.table(group=c(1,2), x=(c(15,101)), id1=c("x","y"))
dt2 <- data.table(group=c(1,2,2,3), x=c(10,50,100,200),id2=c("a","b","c","d"))
Run Code Online (Sandbox Code Playgroud)
如果我加入on=c("group","x"),加入将首先匹配“组”,然后在“x”上获得最近的匹配,所以我希望结果是这样的:
x group id2 id1 roll
1: 15 1 a x nr1
2: 101 2 c y nr1
3: 101 2 b y nr2
Run Code Online (Sandbox Code Playgroud)
这是一些非常原始的东西(我们逐行进行):
n <- 2L
sen <- 1L:n
for (i in 1:nrow(dt1)) {
set(dt1, i, j = "nearest", list(which(frank(abs(dt1$x[i] - dt2$x)) %in% sen)))
}
dt1[, .(id1, nearest = unlist(nearest)), by = x
][, id2 := dt2$id2[nearest]
][, roll := paste0("nr", frank(abs(dt2$x[nearest] - x))), by = x][]
# x id1 nearest id2 roll
# 1: 15 x 1 a nr1
# 2: 15 x 2 b nr2
# 3: 101 y 2 b nr2
# 4: 101 y 3 c nr1
Run Code Online (Sandbox Code Playgroud)
干净一点:
dt1[,
{
nrank <- frank(abs(x - dt2$x), ties.method="first")
nearest <- which(nrank %in% sen)
.(x = x, id2 = dt2$id2[nearest], roll = paste0("nr", nrank[nearest]))
},
by = id1] # assumes unique ids.
Run Code Online (Sandbox Code Playgroud)
数据:
dt1 <- data.table(x = c(15, 101), id1 = c("x", "y"))
dt2 <- data.table(x = c(10, 50, 100, 200), id2 = c("a", "b", "c", "d"))
Run Code Online (Sandbox Code Playgroud)
编辑(根据 OP 的建议/编写)加入多个键:
dt1[,
{
g <- group
dt_tmp <- dt2[dt2$group == g]
nrank <- frank(abs(x - dt_tmp$x), ties.method="first")
nearest <- which(nrank %in% sen)
.(x = x, id2 = dt_tmp$id2[nearest], roll = paste0("nr", nrank[nearest]))
},
by = id1]
Run Code Online (Sandbox Code Playgroud)
为更正的顺序进行了编辑。
我不知道这roll=是否会允许 Nearest- n,但这是一个可能的解决方法:
dt1[, id2 := lapply(x, function(z) { r <- head(order(abs(z - dt2$x)), n = 2); dt2[ r, .(id2, nr = order(r)) ]; }) ]
as.data.table(tidyr::unnest(dt1, id2))
# x id1 id2 nr
# 1: 15 x a 1
# 2: 15 x b 2
# 3: 101 y c 2
# 4: 101 y b 1
Run Code Online (Sandbox Code Playgroud)
(我正在使用,tidyr::unnest因为我认为它在这里很合适并且运行良好,并且data.table/#3672仍然打开。)
第二批数据:
dt1 = data.table(x = c(1, 5, 7), id1 = c("x", "y", "z"))
dt2 = data.table(x = c(2, 5, 6, 10), id2 = c(2, 5, 6, 10))
dt1[, id2 := lapply(x, function(z) { r <- head(order(abs(z - dt2$x)), n = 2); dt2[ r, .(id2, nr = order(r)) ]; }) ]
as.data.table(tidyr::unnest(dt1, id2))
# x id1 id2 nr
# 1: 1 x 2 1
# 2: 1 x 5 2
# 3: 5 y 5 1
# 4: 5 y 6 2
# 5: 7 z 6 2
# 6: 7 z 5 1
Run Code Online (Sandbox Code Playgroud)
这是使用没有额外分组键的滚动连接的另一个选项(对我最初天真的交叉连接想法的改进):
#for differentiating rows from both data.tables
dt1[, ID := .I]
dt2[, rn := .I]
#perform rolling join to find closest and
#then retrieve the +-n rows around that index from dt2
n <- 2L
adjacent <- dt2[dt1, on=.(x), roll="nearest", nomatch=0L, by=.EACHI,
c(.(ID=ID, id1=i.id1, val=i.x), dt2[unique(pmin(pmax(0L, seq(x.rn-n, x.rn+n, by=1L)), .N))])][,
(1L) := NULL]
#extract nth nearest
adjacent[order(abs(val-x)), head(.SD, n), keyby=ID]
Run Code Online (Sandbox Code Playgroud)
输出:
ID id1 val x id2 rn
1: 1 x 15 10 a 1
2: 1 x 15 50 b 2
3: 2 y 101 100 c 3
4: 2 y 101 50 b 2
Run Code Online (Sandbox Code Playgroud)
并使用 Henrik 的数据集:
dt1 = data.table(x = c(1, 5, 7), id1 = c("x", "y", "z"))
dt2 = data.table(x = c(2, 5, 6, 10), id2 = c(2, 5, 6, 10))
Run Code Online (Sandbox Code Playgroud)
输出:
ID id1 val x id2 rn
1: 1 x 1 2 2 1
2: 1 x 1 5 5 2
3: 2 y 5 5 5 2
4: 2 y 5 6 6 3
5: 3 z 7 6 6 3
6: 3 z 7 5 5 2
Run Code Online (Sandbox Code Playgroud)
还有 Henrik 的第二个数据集:
dt1 = data.table(x = 3L, id1="x")
dt2 = data.table(x = 1:2, id2=c("a","b"))
Run Code Online (Sandbox Code Playgroud)
输出:
ID id1 val x id2 rn
1: 1 x 3 2 b 2
2: 1 x 3 1 a 1
Run Code Online (Sandbox Code Playgroud)
并且还加入了一个额外的分组键:
dt2[, rn := .I]
#perform rolling join to find closest and
#then retrieve the +-n rows around that index from dt2
n <- 2L
adjacent <- dt2[dt1, on=.(group, x), roll="nearest", by=.EACHI, {
xrn <- unique(pmax(0L, seq(x.rn-n, x.rn+n, by=1L)), .N)
c(.(id1=id1, x1=i.x),
dt2[.(group=i.group, rn=xrn), on=.(group, rn), nomatch=0L])
}][, (1L:2L) := NULL]
#extract nth nearest
adjacent[order(abs(x1-x)), head(.SD, 2L), keyby=id1] #use id1 to identify rows if its unique, otherwise create ID column like prev section
Run Code Online (Sandbox Code Playgroud)
输出:
id1 x1 group x id2 rn
1: x 15 1 10 a 1
2: y 101 2 100 c 3
3: y 101 2 50 b 2
Run Code Online (Sandbox Code Playgroud)
数据:
library(data.table)
dt1 <- data.table(group=c(1,2), x=(c(15,101)), id1=c("x","y"))
dt2 <- data.table(group=c(1,2,2,3), x=c(10,50,100,200), id2=c("a","b","c","d"))
Run Code Online (Sandbox Code Playgroud)