r2e*_*ans 5 r data.table non-equi-join
我正在尝试进行非等值连接data.table并提取该连接中连接值的最小值/最大值。
set.seed(42)
dtA <- data.table(id=rep(c("A","B"),each=3), start=rep(1:3, times=2), end=rep(2:4, times=2))
dtB <- data.table(id=rep(c("A","B"),times=20), time=sort(runif(40, 1, 4)))
Run Code Online (Sandbox Code Playgroud)
time我想保留介于start和end(以及 on )之间的最小/最大值id。名义上,这只是一个非等值连接,但我找不到by=.EACHI或的组合mult="..."来获得我想要的东西。相反,最小值/最大值通常与我需要的范围不一致。不幸的roll=是不支持非等值范围。
dtA[dtB, c("Min", "Max") := .(min(time), max(time)),
on = .(id, start <= time, end > time), mult = "first"]
# id start end Min Max
# <char> <int> <int> <num> <num>
# 1: A 1 2 1.011845 3.966675
# 2: A 2 3 1.011845 3.966675
# 3: A 3 4 1.011845 3.966675
# 4: B 1 2 1.011845 3.966675
# 5: B 2 3 1.011845 3.966675
# 6: B 3 4 1.011845 3.966675
dtA[dtB, c("Min", "Max") := .(min(time), max(time)),
on = .(id, start <= time, end > time), by = .EACHI]
# id start end Min Max
# <char> <int> <int> <num> <num>
# 1: A 1 2 1.858419 1.858419
# 2: A 2 3 2.970977 2.970977
# 3: A 3 4 3.934679 3.934679
# 4: B 1 2 1.766286 1.766286
# 5: B 2 3 2.925237 2.925237
# 6: B 3 4 3.966675 3.966675
Run Code Online (Sandbox Code Playgroud)
第二个是最接近的(“Max”是正确的),但我希望能够得到的是:
id start end Min Max
<char> <num> <int> <num> <num>
1: A 1 2 1.011845 1.858419
2: A 2 3 2.170610 2.970977
3: A 3 4 3.115194 3.934679
4: B 1 2 1.022002 1.766286
5: B 2 3 2.164325 2.925237
6: B 3 4 3.055509 3.966675
Run Code Online (Sandbox Code Playgroud)
真正的问题有大约 400K 左右的行,其范围加入了 2Mi 行的值,因此我宁愿不对两个帧进行完全扩展,而是手动将其缩小到dtA.
(我愿意接受collapse建议。)
切换连接,使其成为B[A],然后在内部分配A:
dtA[,
c("min","max") := dtB[
dtA,
on=.(id, time >= start, time < end),
.(min=min(x.time), max=max(x.time)),
by=.EACHI][, c("min","max")]
]
dtA
# id start end min max
# <char> <int> <int> <num> <num>
#1: A 1 2 1.011845 1.858419
#2: A 2 3 2.170610 2.970977
#3: A 3 4 3.115194 3.934679
#4: B 1 2 1.022002 1.766286
#5: B 2 3 2.164325 2.925237
#6: B 3 4 3.055509 3.966675
Run Code Online (Sandbox Code Playgroud)
您可以看到它需要旋转,否则该.EACHI组最终将针对每个单独的行而不是条件内的B匹配行:BA
dtB[dtA, on=.(id, time >= start, time < end), .N, by=.EACHI]
# id time time N
# <char> <int> <int> <int>
#1: A 1 2 5
#2: A 2 3 6
#3: A 3 4 9
#4: B 1 2 4
#5: B 2 3 6
#6: B 3 4 10
dtA[dtB, on=.(id, start <= time, end > time), .N, by=.EACHI][, .(freq=.N), by=N]
# N freq
# <int> <int>
#1: 1 40
Run Code Online (Sandbox Code Playgroud)
这在描述的上下文中是有意义的help("data.table::special-symbols"):
它的用法是“by=.EACHI”(或“keyby=.EACHI”),它调用 按每个行进行分组
在DT[i, j, by]逻辑中,dtA然后需要为分组提供行。