R 如何使用向量加速模式匹配

Question

R 如何使用向量加速模式匹配

我在一个数据框中有一列，其中包含城市和州名称：

ac <- c("san francisco ca", "pittsburgh pa", "philadelphia pa", "washington dc", "new york ny", "aliquippa pa", "gainesville fl", "manhattan ks")

ac <- as.data.frame(ac)

我想搜索ac$ac另一个数据框列中的值，如果存在匹配则d$description返回列的值。id

dput(df)
structure(list(month = c(202110L, 201910L, 202005L, 201703L, 
201208L, 201502L), id = c(100559687L, 100558763L, 100558934L, 
100558946L, 100543422L, 100547618L), description = c("residential local telephone service local with more san francisco ca flat rate with eas package plan includes voicemail call forwarding call waiting caller id call restriction three way calling id block speed dialing call return call screening modem rental voip transmission telephone access line 34 95 modem rental 7 00 total 41 95", 
"digital video programming service multilatino ultra bensalem pa service includes digital economy multilatino digital preferred tier and certain additonal digital channels coaxial cable transmission", 
"residential all distance telephone service  unlimited  voice only harrisburg pa flat rate with eas only features call waiting caller id caller id with call waiting call screening call forwarding call forwarding selective call return 69 3 way calling anonymous call rejection repeat dialing speed dial caller id blocking coaxial cable transmission", 
"residential all distance telephone service  unlimited voice only pittsburgh pa flat rate with eas only features call waiting caller id caller id with call waiting call screening call forwarding call forwarding selective call return 69 3 way calling anonymous call rejection repeat dialing speed dial caller id blocking", 
"local spot advertising 30 second advertisement austin tx weekday 6 am 6 pm other audience demographic w18 49 number of rating points for daypart 0 29 average cpp 125", 
"residential public switched toll interstate manhattan ks ks plan area residence switched toll base period average revenue per minute 0 18 minute online"
)), row.names = c(1L, 1245L, 3800L, 10538L, 20362L, 50000L), class = "data.frame")

Run Code Online (Sandbox Code Playgroud)

我尝试通过以下方法访问匹配的行索引来做到这一点：

which(ac$ac %in% df$description)--这返回integer(0)。
grep(ac$ac, df$description, value = FALSE)--这返回第一个索引 1。但这不是矢量化的。
str_detect(string = ac$ac, pattern = df$description)-- 但这会返回所有FALSE不正确的内容。

我的问题：如果匹配，如何搜索ac$acindf$description并返回相应的值？df$id请注意，向量的长度不同。 我正在寻找所有匹配项，而不仅仅是第一个。我更喜欢简单而快速的东西，因为我将使用的实际数据集每个都有超过 100k 行，但欢迎任何建议或想法。谢谢。

编辑。由于安德烈在下面的初步回答，问题的名称已更改，以适应问题范围的变化。

编辑（12/7）：添加赏金以产生额外的兴趣和快速、高效的可扩展解决方案。

编辑（12/8）：澄清——我希望能够将id变量 from添加df到ac数据框中，如ac$id.

Answer 1

Mar*_*łka 9

最简单的解决方案通常是最快的！这是我的建议：

str = paste0(ac, collapse="|")
df$id[grep(str, df$description)]

Run Code Online (Sandbox Code Playgroud)

但你也可以这样

df$id[as.logical(rowSums(!is.na(sapply(ac, function(x) stringr::str_match(df$description, x)))))]

Run Code Online (Sandbox Code Playgroud)

或者这样

df$id[grepl(str, df$description, perl=T)]

Run Code Online (Sandbox Code Playgroud)

不过，还是得比较一下。顺便说一句，我添加了 @Andre Wildberg 和 @Martina C. Arnolda 的建议。以下是基准。

str = paste0(ac, collapse="|")
fFiolka1 = function() df$id[grep(str, df$description)]
fFiolka2 = function() df$id[as.logical(rowSums(!is.na(sapply(ac, function(x) stringr::str_match(df$description, x)))))]
fFiolka3 = function() df$id[grepl(str, df$description, perl=T)]

fWildberg1 = function() df$id[unlist(sapply(ac, function(x) grep(x, df$description)))]
fWildberg2 = function() df$id[as.logical(rowSums(sapply(ac, function(x) stri_detect_regex(df$description, x))))]

fArnolda1 = function() df[grep(str, df$description), ]["id"]
fArnolda2 = function() df[stringi::stri_detect_regex(df$description, str), ]["id"]
fArnolda3 = function() df %>% filter(description %>% str_detect(str)) %>% select(id)

library(microbenchmark)
ggplot2::autoplot(microbenchmark(
  fFiolka1(), fFiolka2(), fFiolka3(),
  fWildberg1(), fWildberg2(),
  fArnolda1(), fArnolda2(), fArnolda3(),
  times=100))

Run Code Online (Sandbox Code Playgroud)

请注意，为了简单起见，我将 ac 保留为向量！。

ac <- c("san francisco ca", "pittsburgh pa", "philadelphia pa", "washington dc", "new york ny", "aliquippa pa", "gainesville fl", "manhattan ks")

Run Code Online (Sandbox Code Playgroud)

@jvalenti 的特别更新

好的。现在我更好地理解您想要实现的目标。不过，为了充分展示最好的解决方案，我稍微修改了你的数据。他们来了

library(tidyverse)

ac <- c("san francisco ca", "pittsburgh pa", "philadelphia pa", "washington dc", "new york ny", "aliquippa pa", "gainesville fl", "manhattan ks")
ac = tibble(ac = ac)

df = structure(list(
  month = c(202110L, 201910L, 202005L, 201703L, 201208L, 201502L), 
  id = c(100559687L, 100558763L, 100558934L, 100558946L, 100543422L, 100547618L), 
  description = c(
    "residential local telephone pittsburgh pa local with more san francisco ca flat rate with eas philadelphia pa plan includes voicemail call forwarding call waiting caller id call restriction three way calling id block speed dialing call return call screening modem rental voip transmission telephone access line 34 95 modem rental 7 00 total 41 95",
    "digital video san francisco ca pittsburgh pa  multilatino ultra bensalem pa service includes digital economy multilatino digital preferred tier and certain additonal digital channels coaxial cable transmission",
    "residential all distance telephone pittsburgh pa unlimited voice only harrisburg pa flat rate with eas only features call waiting caller id caller id with call waiting call screening call forwarding call forwarding selective call return 69 3 way calling anonymous call rejection repeat dialing speed dial caller id blocking coaxial cable transmission",
    "residential all distance telephone pittsburgh pa unlimited voice philadelphia pa san francisco ca pa flat rate with eas only features call waiting caller id caller id with call waiting call screening call forwarding call forwarding selective call return 69 3 way calling anonymous call rejection repeat dialing speed dial caller id blocking",
    "local spot advertising 30 second advertisement austin tx weekday 6 am 6 pm other audience demographic w18 49 number of rating points for daypart 0 29 average cpp 125",
    "residential public switched toll pittsburgh pa manhattan ks ks plan area residence switched toll base san philadelphia pa ca average revenue per minute 0 18 minute online"
  )), row.names = c(1L, 1245L, 3800L, 10538L, 20362L, 50000L), class = "data.frame")

Run Code Online (Sandbox Code Playgroud)

下面您将找到四种不同的解决方案。一种基于循环for，两种解决方案基于dplyr包中的函数，还有一个来自collapse包的函数。

fSolition1 = function(){
  id = vector("list", nrow(ac))
  for(i in seq_along(ac$ac)){
    id[[i]] = df$id[grep(ac$ac[i], df$description)]
  }
  ac %>% mutate(id = id) %>% unnest(id)
}
fSolition1()

fSolition2 = function(){
  ac %>% group_by(ac) %>% 
  mutate(id = list(df$id[grep(ac, df$description)])) %>% 
  unnest(id)
}
fSolition2()

fSolition3 = function(){
  ac %>% rowwise(ac) %>% 
  mutate(id = list(df$id[grep(ac, df$description)])) %>% 
  unnest(id)
}
fSolition3()

fSolition4 = function(){
ac %>%  
  collapse::ftransform(id = lapply(ac, function(x) df$id[grep(x, df$description)])) %>% 
  unnest(id)
}
fSolition4()

Run Code Online (Sandbox Code Playgroud)

请注意，对于给定的数据，所有返回下表作为结果的函数

# A tibble: 12 x 2
   ac                      id
   <chr>                <int>
 1 san francisco ca 100559687
 2 san francisco ca 100558763
 3 san francisco ca 100558946
 4 pittsburgh pa    100559687
 5 pittsburgh pa    100558763
 6 pittsburgh pa    100558934
 7 pittsburgh pa    100558946
 8 pittsburgh pa    100547618
 9 philadelphia pa  100559687
10 philadelphia pa  100558946
11 philadelphia pa  100547618
12 manhattan ks     100547618

Run Code Online (Sandbox Code Playgroud)

是时候进行基准测试了


library(microbenchmark)
ggplot2::autoplot(microbenchmark(
  fSolition1(), fSolition2(), fSolition3(), fSolition4(), times=100))

Run Code Online (Sandbox Code Playgroud)

collapse对于任何人来说，基于此的解决方案是最快的可能并不奇怪。不过，第二名可能会让人大吃一惊。基于for函数的好旧解决方案排在第二位！还有人想说这for很慢吗？

@Gwang-Jin Kim 的特别更新

对向量的作用没有太大变化。往下看。

df_ac = ac$ac
df_decription = df$description
df_id = df$id
fSolition5 = function(){
  id = vector("list", length = length(df_ac))
  for(i in seq_along(df_ac)){
    id[[i]] = df_id[grep(df_ac[i], df_decription)]
  }
  ac %>% mutate(id = id) %>% unnest(id)
}
fSolition5()

library(microbenchmark)
ggplot2::autoplot(microbenchmark(
  fSolition1(), fSolition2(), fSolition3(), fSolition4(), fSolition5(), times=100))

Run Code Online (Sandbox Code Playgroud)

for但和的组合ftransform可能会令人惊讶！

fSolition6 = function(){
  id = vector("list", nrow(ac))
  for(i in seq_along(ac$ac)){
    id[[i]] = df$id[grep(ac$ac[i], df$description)]
  }
  ac %>% collapse::ftransform(id = id) %>% unnest(id)
}
fSolition6()

library(microbenchmark)
ggplot2::autoplot(microbenchmark(
  fSolition1(), fSolition2(), fSolition3(), fSolition4(), fSolition5(), fSolition6(), times=100))

Run Code Online (Sandbox Code Playgroud)

@jvalenti 的最后更新

亲爱的 jvaleniti，在您写的问题中，我在一个数据框中有一列包含城市和州名称，然后我将使用 has over 100k rows。我的结论是，某个给定的城市很可能会在您的变量中出现多次description。

但是，在您写的评论中，我不想更改 ac 中的行数 那么您期望什么样的结果？让我们看看可以用它做什么。

解决方案 1 - 我们将所有内容id作为向量列表返回

ac %>% collapse::ftransform(id = map(ac, ~df$id[grep(.x, df$description)])) 
# # A tibble: 8 x 2
# ac               id       
# * <chr>            <list>   
#   1 san francisco ca <int [3]>
#   2 pittsburgh pa    <int [5]>
#   3 philadelphia pa  <int [3]>
#   4 washington dc    <int [0]>
#   5 new york ny      <int [0]>
#   6 aliquippa pa     <int [0]>
#   7 gainesville fl   <int [0]>
#   8 manhattan ks     <int [1]>

Run Code Online (Sandbox Code Playgroud)

解决方案 2 - 我们只返回第一个id

ac %>% collapse::ftransform(id = map_int(ac, ~df$id[grep(.x, df$description)][1])) 
# # A tibble: 8 x 2
# ac                      id
# * <chr>                <int>
# 1 san francisco ca 100559687
# 2 pittsburgh pa    100559687
# 3 philadelphia pa  100559687
# 4 washington dc           NA
# 5 new york ny             NA
# 6 aliquippa pa            NA
# 7 gainesville fl          NA
# 8 manhattan ks     100547618

Run Code Online (Sandbox Code Playgroud)

解决方案 3 - 我们只返回最后一个id

ac %>%
  collapse::ftransform(id = map_int(ac, function(x) {
    idx = grep(x, df$description)
    ifelse(length(idx)>0, df$id[idx[length(idx)]], NA)})) 
# # A tibble: 8 x 2
# ac                      id
# * <chr>                <int>
# 1 san francisco ca 100558946
# 2 pittsburgh pa    100547618
# 3 philadelphia pa  100547618
# 4 washington dc           NA
# 5 new york ny             NA
# 6 aliquippa pa            NA
# 7 gainesville fl          NA
# 8 manhattan ks     100547618

Run Code Online (Sandbox Code Playgroud)

解决方案 4 - 或者您可能想id从所有可能的方案中选择一个

ac %>%
  collapse::ftransform(id = map_int(ac, function(x) {
    idx = grep(x, df$description)
    ifelse(length(idx)==0, NA, ifelse(length(idx)==1, df$id[idx], df$id[sample(idx, 1)]))})) 
# # A tibble: 8 x 2
# ac                      id
# * <chr>                <int>
# 1 san francisco ca 100558763
# 2 pittsburgh pa    100559687
# 3 philadelphia pa  100547618
# 4 washington dc           NA
# 5 new york ny             NA
# 6 aliquippa pa            NA
# 7 gainesville fl          NA
# 8 manhattan ks     100547618

Run Code Online (Sandbox Code Playgroud)

解决方案 5 - 如果您不小心想查看所有 id 并想ac同时保留行数

ac %>%
  collapse::ftransform(id = map(ac, function(x) {
    idx = grep(x, df$description)
    if(length(idx)==0) tibble(id = NA, idn = "id1") else tibble(
      id = df$id[idx],
      idn = paste0("id",1:length(id)))})) %>% 
  unnest(id) %>% 
  pivot_wider(ac, names_from = idn, values_from = id)
# # A tibble: 8 x 6
# ac                     id1       id2       id3       id4       id5
# <chr>                <int>     <int>     <int>     <int>     <int>
# 1 san francisco ca 100559687 100558763 100558946        NA        NA
# 2 pittsburgh pa    100559687 100558763 100558934 100558946 100547618
# 3 philadelphia pa  100559687 100558946 100547618        NA        NA
# 4 washington dc           NA        NA        NA        NA        NA
# 5 new york ny             NA        NA        NA        NA        NA
# 6 aliquippa pa            NA        NA        NA        NA        NA
# 7 gainesville fl          NA        NA        NA        NA        NA
# 8 manhattan ks     100547618        NA        NA        NA        NA

Run Code Online (Sandbox Code Playgroud)

遗憾的是，您提供的描述并未表明以上五种解决方案中哪一种是您可以接受的解决方案。您必须自己决定。

归档时间：	3 年，11 月前
查看次数：	542 次
最近记录：	3 年，11 月前