dplyr:inner_join,部分字符串匹配

Ste*_*ner 19 string join r stringr dplyr

我想加入两个数据帧,如果seed在数据帧列y是对部分匹配stringx.这个例子应该说明:

# What I have
x <- data.frame(idX=1:3, string=c("Motorcycle", "TractorTrailer", "Sailboat"))
y <- data_frame(idY=letters[1:3], seed=c("ractor", "otorcy", "irplan"))


x

  idX         string
1   1     Motorcycle
2   2 TractorTrailer
3   3       Sailboat

y

Source: local data frame [3 x 2]

    idY   seed
  (chr)  (chr)
1     a ractor
2     b otorcy
3     c irplan


# What I want
want <- data.frame(idX=c(1,2), idY=c("b", "a"), string=c("Motorcycle", "TractorTrailer"), seed=c("otorcy", "ractor"))

want

  idX idY         string   seed
1   1   b     Motorcycle otorcy
2   2   a TractorTrailer ractor
Run Code Online (Sandbox Code Playgroud)

就是这样的

inner_join(x, y, by=stringr::str_detect(x$string, y$seed))
Run Code Online (Sandbox Code Playgroud)

Fen*_*Mai 22

fuzzyjoin库有两个功能regex_inner_join,并fuzzy_inner_join允许您以匹配部分字符串:

x <- data.frame(idX=1:3, string=c("Motorcycle", "TractorTrailer", "Sailboat"))
y <- data.frame(idY=letters[1:3], seed=c("ractor", "otorcy", "irplan"))
x$string = as.character(x$string)
y$seed = as.character(y$seed)


library(fuzzyjoin)
x %>% regex_inner_join(y, by = c(string = "seed"))

  idX         string idY   seed
1   1     Motorcycle   b otorcy
2   2 TractorTrailer   a ractor


library(stringr)
x %>% fuzzy_inner_join(y, by = c("string" = "seed"), match_fun = str_detect)


  idX         string idY   seed
1   1     Motorcycle   b otorcy
2   2 TractorTrailer   a ractor
Run Code Online (Sandbox Code Playgroud)

  • 为了在大型表上获得更好的性能,您可以使用来自stringi包的match_fun = stri_detect_fixed。 (2认同)

Dav*_*vid 7

您也可以将此函数与base-r一起使用(此处略作修改:https : //stackoverflow.com/a/34723496/3048453,它使用dplyr将列绑定在一起,cbind如果不想使用,请使用dplyr):

partial_join <- function(x, y, by_x, pattern_y)
 idx_x <- sapply(y[[pattern_y]], grep, x[[by_x]])
 idx_y <- sapply(seq_along(idx_x), function(i) rep(i, length(idx_x[[i]])))

 df <- dplyr::bind_cols(x[unlist(idx_x), , drop = F],
                        y[unlist(idx_y), , drop = F])
 return(df)
}
Run Code Online (Sandbox Code Playgroud)

以你的例子

x <- data.frame(idX=1:3, string=c("Motorcycle", "TractorTrailer", "Sailboat"))
y <- data_frame(idY=letters[1:3], seed=c("ractor", "otorcy", "irplan"))

df_merged <- partial_join(x, y, by_x = "string", pattern_y = "seed")
df_merged
# # A tibble: 2 × 4
#     idX         string   idY   seed
#   <int>          <chr> <chr>  <chr>
# 1     1     Motorcycle     b otorcy
# 2     2 TractorTrailer     a ractor
Run Code Online (Sandbox Code Playgroud)

速度基准:

功能


partial_join <- function(x, y, by_x, pattern_y)
 idx_x <- sapply(y[[pattern_y]], grep, x[[by_x]])
 idx_y <- sapply(seq_along(idx_x), function(i) rep(i, length(idx_x[[i]])))

 df <- dplyr::bind_cols(x[unlist(idx_x), , drop = F],
                        y[unlist(idx_y), , drop = F])
 return(df)
}
Run Code Online (Sandbox Code Playgroud)

基准测试

x <- data.frame(idX=1:3, string=c("Motorcycle", "TractorTrailer", "Sailboat"))
y <- data_frame(idY=letters[1:3], seed=c("ractor", "otorcy", "irplan"))

df_merged <- partial_join(x, y, by_x = "string", pattern_y = "seed")
df_merged
# # A tibble: 2 × 4
#     idX         string   idY   seed
#   <int>          <chr> <chr>  <chr>
# 1     1     Motorcycle     b otorcy
# 2     2 TractorTrailer     a ractor
Run Code Online (Sandbox Code Playgroud)


jor*_*ran 6

我不知道这对于更大的数据会有什么表现,但它(或其变体)可能值得一试:

library(dplyr)

x <- data.frame(idX=1:3, string=c("Motorcycle", "TractorTrailer", "Sailboat"))
y <- data_frame(idY=letters[1:3], seed=c("ractor", "otorcy", "irplan"))

my_db <- src_sqlite(path = tempfile(),create= TRUE)
x_tbl <- copy_to(dest = my_db,df = x)
y_tbl <- copy_to(dest = my_db,df = y)

result <- tbl(my_db,sql("select * from x,y where x.string like '%' || y.seed || '%'"))
> collect(result)

Source: local data frame [2 x 4]

    idX         string   idY   seed
  (int)          (chr) (chr)  (chr)
1     1     Motorcycle     b otorcy
2     2 TractorTrailer     a ractor
Run Code Online (Sandbox Code Playgroud)

我也不能说它的性能可能会因数据库而异。postgres 或 mysql 在这种查询上可能更好或更差。


Ste*_*ner 4

这是可行的,但在巨大的数据集上它会非常慢。

x <- data.frame(idX=1:3, string=c("Motorcycle", "TractorTrailer", "Sailboat"))
y <- data_frame(idY=letters[1:3], seed=c("ractor", "otorcy", "irplan"))

library(dplyr)
full_join(mutate(x, i=1), 
          mutate(y, i=1)) %>% 
  select(-i) %>% 
  filter(str_detect(string, seed))
Run Code Online (Sandbox Code Playgroud)