dplyr可以连接多个列还是复合键?

Jas*_*lns 96 r dplyr

我意识到dplyrv3.0允许你加入不同的变量:

left_join(x, y, by = c("a" = "b")将匹配x.ay.b

但是,是否可以连接变量组合或者我必须事先添加复合键?

像这样的东西:

left_join(x, y, by = c("a c" = "b d")匹配[ x.ax.c]到[ y.by.d] 的串联

dav*_*ers 167

您可以将长度大于1的命名向量传递给以下by参数left_join():

library(dplyr)

d1 <- data_frame(
  x = letters[1:3],
  y = LETTERS[1:3],
  a = rnorm(3)
  )

d2 <- data_frame(
  x2 = letters[3:1],
  y2 = LETTERS[3:1],
  b = rnorm(3)
  )

left_join(d1, d2, by = c("x" = "x2", "y" = "y2"))
Run Code Online (Sandbox Code Playgroud)

  • 当连接列相同时,你也可以避免使用`=`:`left_join(d1,d2,by = c("firstname","lastname"))` (7认同)
  • 谢谢你 当数据帧中的列具有相同的名称时也可以使用,例如`left_join(d1,d2,by = c(“ firstname” =“ firstname”,“ lastname” =“ lastname”)))。对某些人可能并不明显。 (2认同)
  • 超然......我正在坚持回家,但是......这似乎是一个AND ......我认为这是有意义的,但我希望它是一个x = x2或y = y2,因为我有多个索引构建用于尝试识别不同资源中的重复但已损坏的条目. (2认同)

Mar*_*ark 8

截至 2022 年 5 月,我们现在还可以选择使用join_by(),除了允许按特定列进行连接(如 Dave 的回答)之外,它还允许通过多种其他方式连接两个数据帧。

我们可以用:

  • 平等条件:==
  • 不等式条件:>=、>、<= 或 <
  • 滚动助手:closest()
  • 重叠助手: Between()、within() 或 Overlaps()

例子:

# first we create a dataset similar to Dave's one, but with a few more columns, which make it easy to demonstrate the other joins
library(tidyverse)
set.seed(0)

dfx <- tibble(
  id = 1:3,
  first_name = c("Alice", "Bob", "Charlie"),
  last_name = c("Adams", "Baker", "Chaplin"),
  a = rnorm(3),
  lb = 0.25,
  ub = 0.75)

dfy <- tibble(
  id = 1:3,
  first_name = c("Alice", "Bob", "Charlie"),
  last_name = c("Adams", "Baker", "Chaplin"),
  b = rnorm(3),
  other_range = 0,
  other_range2 = 1)
Run Code Online (Sandbox Code Playgroud)

平等加入(OP要求的):

left_join(dfx, dfy, join_by(id, first_name, last_name == last_name))
Run Code Online (Sandbox Code Playgroud)

注意:如果两个数据框中要连接的列的名称相同,则不需要执行 col == col,只需使用 col,如上例中的前两列。

不等式连接:

left_join(dfx, dfy, join_by(a < b)) # join the rows where a < b
Run Code Online (Sandbox Code Playgroud)

滚动连接:

left_join(dfx, dfy, join_by(closest(a < b))) # similar to above, but only take the closest match
Run Code Online (Sandbox Code Playgroud)

重叠连接:

left_join(dfx, dfy, join_by(between(a, other_range, other_range2))) # join rows where a is between other_range and other_range2

left_join(dfx, dfy, join_by(overlaps(lb, ub, other_range, other_range2))) # join rows where the the ranges (lb to ub, and other_range to other_range2) overlap

left_join(dfx, dfy, join_by(within(lb, ub, other_range, other_range2))) # join rows where lb to ub is within other_range to other_range2
Run Code Online (Sandbox Code Playgroud)

另请注意:join_by()假设您将在右列之前列出左列的列。如果您出于某种原因不想这样做,请用于x$左侧数据帧,并y$用于右侧数据帧,例如join_by(x$a < y$b)

有关更多信息,请阅读文档

更新:

我意识到我从未真正解决OP问题的核心:

像这样的东西:left_join(x, y, by = c("a c" = "b d")

您不能专门这样做,因为 dplyr 期望每个字符串都是列的名称。但是,如果您有两个字符串,其中包含由空格分隔的列,则可以执行以下操作:

j1 <- "id first_name last_name"
j2 <- j1 # let's pretend for the sake of argument they are different, as it doesn't change the answer

join_vec <- function(j1, j2) {
    setNames(str_split(j2, " ")[[1]], str_split(j1, " ")[[1]])
}

left_join(dfx, dfy, by = join_vec(j1, j2))
Run Code Online (Sandbox Code Playgroud)