从 R 中的字符串向量匹配单词

Jam*_*mes 10 regex string r string-matching stringr

我正在尝试通过将杂乱的站点名称列表与已批准的列表进行匹配来清理数据库。

例如,首选站点名称可能是“Cotswold Water Park Pit 28”,但该站点已输入到数据库中:“Pit 28”、“28”、“CWP Pit 28”和“Cotswold 28”。

数据看起来像这样:

approved <- c("Cotswold Water Park Pit 28", "Cotswold Water Park Pit 14", "Robinswood Hill")

messy <- c("Pit 28", "28", "CWP Pit 28", "Cotswold 28", "14", "Robinswood")
Run Code Online (Sandbox Code Playgroud)

我正在寻找一种方法来将每个元素中的单词/数字(非空格字符簇)messyapproved. 理想情况下,我最终会得到这样的结果:

     Cotswold Water Park Pit 28 Cotswold Water Park Pit 14 Robinswood Hill
[1,] "Pit 28"                   "Pit 28"                   "Robinswood"   
[2,] "28"                       "CWP Pit 28"               NA             
[3,] "CWP Pit 28"               "14"                       NA             
[4,] "Cotswold 28"              NA                         NA   
Run Code Online (Sandbox Code Playgroud)

这些approved元素形成列名称,并且任何messy包含匹配单词/数字的元素出现在该列的单元格中。我承认会有一些错误的匹配。这很好,我可以稍后手动过滤它们,并且可能会从模式匹配中排除诸如“forest”和“hill”之类的常见词。

通过拆分messyusing 中的每个元素,我已经能够使用上述示例数据获得我想要的结果,regex但随后我正在处理站点名称列表中的单词/数字列表,并且我不得不使用嵌套循环或者sapply将它们与批准中的元素匹配,因为功能类似于grepgrepl并且str_detect只允许一种模式。由于数据库很大,当我将它应用到整个事情时已经花费了很长时间。我真正想要的是一个功能,它可以:

match(any word in approved[1], any word in messy[1])
Run Code Online (Sandbox Code Playgroud)

要么给我一个TRUE FALSE输出要么提取messy[1]它是否匹配会很棒!

GKi*_*GKi 6

也许您正在寻找adist

x <- adist(messy, approved, fixed=FALSE, ignore.case = TRUE)
y <- t(adist(approved, messy, fixed=FALSE, ignore.case = TRUE))
i <- x == apply(x, 1, min)
y[!i]  <- NA
colnames(y) <- approved
i <- apply(y == apply(y, 1, min, na.rm=TRUE), 2, function(i) messy[i & !is.na(i)])
do.call(cbind, lapply(i, function(x) x[seq_len(max(lengths(i)))]))
#     Cotswold Water Park Pit 28 Cotswold Water Park Pit 14 Robinswood Hill
#[1,] "Pit 28"                   "14"                       "Robinswood"   
#[2,] "28"                       NA                         NA             
#[3,] "CWP Pit 28"               NA                         NA             
#[4,] "Cotswold 28"              NA                         NA             
Run Code Online (Sandbox Code Playgroud)


Ron*_*hah 6

一个基本的 R 选项是:

result <- sapply(approved, function(x) grep(gsub('\\s+', '|', x), messy, value = TRUE))
result
#$`Cotswold Water Park Pit 28`
#[1] "Pit 28"      "28"          "CWP Pit 28"  "Cotswold 28"

#$`Cotswold Water Park Pit 14`
#[1] "Pit 28"      "CWP Pit 28"  "Cotswold 28" "14"         

#$`Robinswood Hill`
#[1] "Robinswood"
Run Code Online (Sandbox Code Playgroud)

这里的逻辑是我们|在 in 的每个空格处插入管道 ( ) 符号,如果有任何单词匹配approvedmessy则返回单词 in 。

要获得与所示格式相同的输出,我们可以执行以下操作:

sapply(result, `[`, 1:max(lengths(result)))

#     Cotswold Water Park Pit 28 Cotswold Water Park Pit 14 Robinswood Hill
#[1,] "Pit 28"                   "Pit 28"                   "Robinswood"   
#[2,] "28"                       "CWP Pit 28"               NA             
#[3,] "CWP Pit 28"               "Cotswold 28"              NA             
#[4,] "Cotswold 28"              "14"                       NA   
Run Code Online (Sandbox Code Playgroud)


Mou*_*d_S 5

一个 tidyverse/tidytext 解决方案

首先将它们变成数据框

require(tidyverse) 
require(tidytext)


## create dataframe for approved 

approved <- c("Cotswold Water Park Pit 28", "Cotswold Water Park Pit 14", "Robinswood Hill")


## create dataframe for messy 

messy <- c("Pit 28", "28", "CWP Pit 28", "Cotswold 28", "14", "Robinswood")
Run Code Online (Sandbox Code Playgroud)

然后使用 tidytext 将它们拆分为 1 个单词 = 1 行,我喜欢在行数发生变化时添加 ID ......

## split into words 

approved_df <- 
tibble(approved = approved) %>%  
  rownames_to_column('approved_id') %>% 
  unnest_tokens(words, approved, 'words', drop = FALSE)

approved_df %>%  head 

# A tibble: 6 x 3
# approved_id approved                   words   
# <chr>       <chr>                      <chr>   
# 1 1           Cotswold Water Park Pit 28 cotswold
# 2 1           Cotswold Water Park Pit 28 water   
# 3 1           Cotswold Water Park Pit 28 park    
# 4 1           Cotswold Water Park Pit 28 pit     
# 5 1           Cotswold Water Park Pit 28 28      
# 6 2           Cotswold Water Park Pit 14 cotswold
    
messy_df <- 
tibble(messy = messy) %>%  
  rownames_to_column('messy_id') %>% 
  unnest_tokens(words, messy, 'words', drop = FALSE)

messy_df %>%  head          
# # A tibble: 6 x 3
# messy_id messy      words
# <chr>    <chr>      <chr>
# 1 1        Pit 28     pit  
# 2 1        Pit 28     28   
# 3 2        28         28   
# 4 3        CWP Pit 28 cwp  
# 5 3        CWP Pit 28 pit  
# 6 3        CWP Pit 28 28   
Run Code Online (Sandbox Code Playgroud)

最后,在单词级别连接两个数据框,计算重叠中的单词数量,然后为每个“凌乱”的字符串分配一个“已批准的”

     ## join the data sets and rank by the number of words in the overlap
  
  messy_df %>%  left_join(approved_df) %>%  
    group_by(messy, messy_id, approved, approved_id) %>%  
    summarise(n_row = n()) %>%  
    ungroup %>%  
    group_by(messy, messy_id) %>%  
    mutate(approved_rank = rank(desc(n_row))) %>%  
    ungroup %>%  
    filter(approved_rank == 1) %>%  
    arrange(messy_id)



  # Joining, by = "words"
  # # A tibble: 6 x 6
  # messy       messy_id approved                   approved_id n_row approved_rank
  # <chr>       <chr>    <chr>                      <chr>       <int>         <dbl>
  # 1 Pit 28      1        Cotswold Water Park Pit 28 1               2             1
  # 2 28          2        Cotswold Water Park Pit 28 1               1             1
  # 3 CWP Pit 28  3        Cotswold Water Park Pit 28 1               2             1
  # 4 Cotswold 28 4        Cotswold Water Park Pit 28 1               2             1
  # 5 14          5        Cotswold Water Park Pit 14 2               1             1
  # 6 Robinswood  6        Robinswood Hill            3               1             1
Run Code Online (Sandbox Code Playgroud)


Wim*_*pel 5

这是一个高度灵活的 regex_join 解决方案

library( fuzzyjoin )
library( data.table )
#make data.frames
messy.df <- data.frame( messy ); approved.df <- data.frame( approved )
#create regexes
messy.df$regex <- gsub( " ", "|", messy.df$messy )
#regex join
ans <- regex_full_join( approved.df, messy.df, by = c("approved" = "regex") )
#cast to wide
dcast( setDT(ans), messy~approved, value.var = "messy")[, -1]

#      Cotswold Water Park Pit 14 Cotswold Water Park Pit 28 Robinswood Hill
#   1:                         14                       <NA>            <NA>
#   2:                       <NA>                         28            <NA>
#   3:                 CWP Pit 28                 CWP Pit 28            <NA>
#   4:                Cotswold 28                Cotswold 28            <NA>
#   5:                     Pit 28                     Pit 28            <NA>
#   6:                       <NA>                       <NA>      Robinswood
Run Code Online (Sandbox Code Playgroud)