Ste*_*ner 19 string join r stringr dplyr
我想加入两个数据帧,如果seed在数据帧列y是对部分匹配string列x.这个例子应该说明:
# What I have
x <- data.frame(idX=1:3, string=c("Motorcycle", "TractorTrailer", "Sailboat"))
y <- data_frame(idY=letters[1:3], seed=c("ractor", "otorcy", "irplan"))
x
idX string
1 1 Motorcycle
2 2 TractorTrailer
3 3 Sailboat
y
Source: local data frame [3 x 2]
idY seed
(chr) (chr)
1 a ractor
2 b otorcy
3 c irplan
# What I want
want <- data.frame(idX=c(1,2), idY=c("b", "a"), string=c("Motorcycle", "TractorTrailer"), seed=c("otorcy", "ractor"))
want
idX idY string seed
1 1 b Motorcycle otorcy
2 2 a TractorTrailer ractor
Run Code Online (Sandbox Code Playgroud)
就是这样的
inner_join(x, y, by=stringr::str_detect(x$string, y$seed))
Run Code Online (Sandbox Code Playgroud)
Fen*_*Mai 22
该fuzzyjoin库有两个功能regex_inner_join,并fuzzy_inner_join允许您以匹配部分字符串:
x <- data.frame(idX=1:3, string=c("Motorcycle", "TractorTrailer", "Sailboat"))
y <- data.frame(idY=letters[1:3], seed=c("ractor", "otorcy", "irplan"))
x$string = as.character(x$string)
y$seed = as.character(y$seed)
library(fuzzyjoin)
x %>% regex_inner_join(y, by = c(string = "seed"))
idX string idY seed
1 1 Motorcycle b otorcy
2 2 TractorTrailer a ractor
library(stringr)
x %>% fuzzy_inner_join(y, by = c("string" = "seed"), match_fun = str_detect)
idX string idY seed
1 1 Motorcycle b otorcy
2 2 TractorTrailer a ractor
Run Code Online (Sandbox Code Playgroud)
您也可以将此函数与base-r一起使用(此处略作修改:https : //stackoverflow.com/a/34723496/3048453,它使用dplyr将列绑定在一起,cbind如果不想使用,请使用dplyr):
partial_join <- function(x, y, by_x, pattern_y)
idx_x <- sapply(y[[pattern_y]], grep, x[[by_x]])
idx_y <- sapply(seq_along(idx_x), function(i) rep(i, length(idx_x[[i]])))
df <- dplyr::bind_cols(x[unlist(idx_x), , drop = F],
y[unlist(idx_y), , drop = F])
return(df)
}
Run Code Online (Sandbox Code Playgroud)
以你的例子
x <- data.frame(idX=1:3, string=c("Motorcycle", "TractorTrailer", "Sailboat"))
y <- data_frame(idY=letters[1:3], seed=c("ractor", "otorcy", "irplan"))
df_merged <- partial_join(x, y, by_x = "string", pattern_y = "seed")
df_merged
# # A tibble: 2 × 4
# idX string idY seed
# <int> <chr> <chr> <chr>
# 1 1 Motorcycle b otorcy
# 2 2 TractorTrailer a ractor
Run Code Online (Sandbox Code Playgroud)
partial_join <- function(x, y, by_x, pattern_y)
idx_x <- sapply(y[[pattern_y]], grep, x[[by_x]])
idx_y <- sapply(seq_along(idx_x), function(i) rep(i, length(idx_x[[i]])))
df <- dplyr::bind_cols(x[unlist(idx_x), , drop = F],
y[unlist(idx_y), , drop = F])
return(df)
}
Run Code Online (Sandbox Code Playgroud)
x <- data.frame(idX=1:3, string=c("Motorcycle", "TractorTrailer", "Sailboat"))
y <- data_frame(idY=letters[1:3], seed=c("ractor", "otorcy", "irplan"))
df_merged <- partial_join(x, y, by_x = "string", pattern_y = "seed")
df_merged
# # A tibble: 2 × 4
# idX string idY seed
# <int> <chr> <chr> <chr>
# 1 1 Motorcycle b otorcy
# 2 2 TractorTrailer a ractor
Run Code Online (Sandbox Code Playgroud)
我不知道这对于更大的数据会有什么表现,但它(或其变体)可能值得一试:
library(dplyr)
x <- data.frame(idX=1:3, string=c("Motorcycle", "TractorTrailer", "Sailboat"))
y <- data_frame(idY=letters[1:3], seed=c("ractor", "otorcy", "irplan"))
my_db <- src_sqlite(path = tempfile(),create= TRUE)
x_tbl <- copy_to(dest = my_db,df = x)
y_tbl <- copy_to(dest = my_db,df = y)
result <- tbl(my_db,sql("select * from x,y where x.string like '%' || y.seed || '%'"))
> collect(result)
Source: local data frame [2 x 4]
idX string idY seed
(int) (chr) (chr) (chr)
1 1 Motorcycle b otorcy
2 2 TractorTrailer a ractor
Run Code Online (Sandbox Code Playgroud)
我也不能说它的性能可能会因数据库而异。postgres 或 mysql 在这种查询上可能更好或更差。
这是可行的,但在巨大的数据集上它会非常慢。
x <- data.frame(idX=1:3, string=c("Motorcycle", "TractorTrailer", "Sailboat"))
y <- data_frame(idY=letters[1:3], seed=c("ractor", "otorcy", "irplan"))
library(dplyr)
full_join(mutate(x, i=1),
mutate(y, i=1)) %>%
select(-i) %>%
filter(str_detect(string, seed))
Run Code Online (Sandbox Code Playgroud)