我意识到dplyrv3.0允许你加入不同的变量:
left_join(x, y, by = c("a" = "b")将匹配x.a到y.b
但是,是否可以连接变量组合或者我必须事先添加复合键?
像这样的东西:
left_join(x, y, by = c("a c" = "b d")匹配[ x.a和x.c]到[ y.b和y.d] 的串联
dav*_*ers 167
您可以将长度大于1的命名向量传递给以下by参数left_join():
library(dplyr)
d1 <- data_frame(
x = letters[1:3],
y = LETTERS[1:3],
a = rnorm(3)
)
d2 <- data_frame(
x2 = letters[3:1],
y2 = LETTERS[3:1],
b = rnorm(3)
)
left_join(d1, d2, by = c("x" = "x2", "y" = "y2"))
Run Code Online (Sandbox Code Playgroud)
截至 2022 年 5 月,我们现在还可以选择使用join_by(),除了允许按特定列进行连接(如 Dave 的回答)之外,它还允许通过多种其他方式连接两个数据帧。
我们可以用:
- 平等条件:==
- 不等式条件:>=、>、<= 或 <
- 滚动助手:closest()
- 重叠助手: Between()、within() 或 Overlaps()
# first we create a dataset similar to Dave's one, but with a few more columns, which make it easy to demonstrate the other joins
library(tidyverse)
set.seed(0)
dfx <- tibble(
id = 1:3,
first_name = c("Alice", "Bob", "Charlie"),
last_name = c("Adams", "Baker", "Chaplin"),
a = rnorm(3),
lb = 0.25,
ub = 0.75)
dfy <- tibble(
id = 1:3,
first_name = c("Alice", "Bob", "Charlie"),
last_name = c("Adams", "Baker", "Chaplin"),
b = rnorm(3),
other_range = 0,
other_range2 = 1)
Run Code Online (Sandbox Code Playgroud)
left_join(dfx, dfy, join_by(id, first_name, last_name == last_name))
Run Code Online (Sandbox Code Playgroud)
注意:如果两个数据框中要连接的列的名称相同,则不需要执行 col == col,只需使用 col,如上例中的前两列。
left_join(dfx, dfy, join_by(a < b)) # join the rows where a < b
Run Code Online (Sandbox Code Playgroud)
left_join(dfx, dfy, join_by(closest(a < b))) # similar to above, but only take the closest match
Run Code Online (Sandbox Code Playgroud)
left_join(dfx, dfy, join_by(between(a, other_range, other_range2))) # join rows where a is between other_range and other_range2
left_join(dfx, dfy, join_by(overlaps(lb, ub, other_range, other_range2))) # join rows where the the ranges (lb to ub, and other_range to other_range2) overlap
left_join(dfx, dfy, join_by(within(lb, ub, other_range, other_range2))) # join rows where lb to ub is within other_range to other_range2
Run Code Online (Sandbox Code Playgroud)
另请注意:join_by()假设您将在右列之前列出左列的列。如果您出于某种原因不想这样做,请用于x$左侧数据帧,并y$用于右侧数据帧,例如join_by(x$a < y$b)。
有关更多信息,请阅读文档。
我意识到我从未真正解决OP问题的核心:
像这样的东西:
left_join(x, y, by = c("a c" = "b d")
您不能专门这样做,因为 dplyr 期望每个字符串都是列的名称。但是,如果您有两个字符串,其中包含由空格分隔的列,则可以执行以下操作:
j1 <- "id first_name last_name"
j2 <- j1 # let's pretend for the sake of argument they are different, as it doesn't change the answer
join_vec <- function(j1, j2) {
setNames(str_split(j2, " ")[[1]], str_split(j1, " ")[[1]])
}
left_join(dfx, dfy, by = join_vec(j1, j2))
Run Code Online (Sandbox Code Playgroud)