Jam*_*mes 10 regex string r string-matching stringr
我正在尝试通过将杂乱的站点名称列表与已批准的列表进行匹配来清理数据库。
例如,首选站点名称可能是“Cotswold Water Park Pit 28”,但该站点已输入到数据库中:“Pit 28”、“28”、“CWP Pit 28”和“Cotswold 28”。
数据看起来像这样:
approved <- c("Cotswold Water Park Pit 28", "Cotswold Water Park Pit 14", "Robinswood Hill")
messy <- c("Pit 28", "28", "CWP Pit 28", "Cotswold 28", "14", "Robinswood")
Run Code Online (Sandbox Code Playgroud)
我正在寻找一种方法来将每个元素中的单词/数字(非空格字符簇)messy与approved. 理想情况下,我最终会得到这样的结果:
Cotswold Water Park Pit 28 Cotswold Water Park Pit 14 Robinswood Hill
[1,] "Pit 28" "Pit 28" "Robinswood"
[2,] "28" "CWP Pit 28" NA
[3,] "CWP Pit 28" "14" NA
[4,] "Cotswold 28" NA NA
Run Code Online (Sandbox Code Playgroud)
这些approved元素形成列名称,并且任何messy包含匹配单词/数字的元素出现在该列的单元格中。我承认会有一些错误的匹配。这很好,我可以稍后手动过滤它们,并且可能会从模式匹配中排除诸如“forest”和“hill”之类的常见词。
通过拆分messyusing 中的每个元素,我已经能够使用上述示例数据获得我想要的结果,regex但随后我正在处理站点名称列表中的单词/数字列表,并且我不得不使用嵌套循环或者sapply将它们与批准中的元素匹配,因为功能类似于grep,grepl并且str_detect只允许一种模式。由于数据库很大,当我将它应用到整个事情时已经花费了很长时间。我真正想要的是一个功能,它可以:
match(any word in approved[1], any word in messy[1])
Run Code Online (Sandbox Code Playgroud)
要么给我一个TRUE FALSE输出要么提取messy[1]它是否匹配会很棒!
也许您正在寻找adist:
x <- adist(messy, approved, fixed=FALSE, ignore.case = TRUE)
y <- t(adist(approved, messy, fixed=FALSE, ignore.case = TRUE))
i <- x == apply(x, 1, min)
y[!i] <- NA
colnames(y) <- approved
i <- apply(y == apply(y, 1, min, na.rm=TRUE), 2, function(i) messy[i & !is.na(i)])
do.call(cbind, lapply(i, function(x) x[seq_len(max(lengths(i)))]))
# Cotswold Water Park Pit 28 Cotswold Water Park Pit 14 Robinswood Hill
#[1,] "Pit 28" "14" "Robinswood"
#[2,] "28" NA NA
#[3,] "CWP Pit 28" NA NA
#[4,] "Cotswold 28" NA NA
Run Code Online (Sandbox Code Playgroud)
一个基本的 R 选项是:
result <- sapply(approved, function(x) grep(gsub('\\s+', '|', x), messy, value = TRUE))
result
#$`Cotswold Water Park Pit 28`
#[1] "Pit 28" "28" "CWP Pit 28" "Cotswold 28"
#$`Cotswold Water Park Pit 14`
#[1] "Pit 28" "CWP Pit 28" "Cotswold 28" "14"
#$`Robinswood Hill`
#[1] "Robinswood"
Run Code Online (Sandbox Code Playgroud)
这里的逻辑是我们|在 in 的每个空格处插入管道 ( ) 符号,如果有任何单词匹配approved,messy则返回单词 in 。
要获得与所示格式相同的输出,我们可以执行以下操作:
sapply(result, `[`, 1:max(lengths(result)))
# Cotswold Water Park Pit 28 Cotswold Water Park Pit 14 Robinswood Hill
#[1,] "Pit 28" "Pit 28" "Robinswood"
#[2,] "28" "CWP Pit 28" NA
#[3,] "CWP Pit 28" "Cotswold 28" NA
#[4,] "Cotswold 28" "14" NA
Run Code Online (Sandbox Code Playgroud)
一个 tidyverse/tidytext 解决方案
首先将它们变成数据框
require(tidyverse)
require(tidytext)
## create dataframe for approved
approved <- c("Cotswold Water Park Pit 28", "Cotswold Water Park Pit 14", "Robinswood Hill")
## create dataframe for messy
messy <- c("Pit 28", "28", "CWP Pit 28", "Cotswold 28", "14", "Robinswood")
Run Code Online (Sandbox Code Playgroud)
然后使用 tidytext 将它们拆分为 1 个单词 = 1 行,我喜欢在行数发生变化时添加 ID ......
## split into words
approved_df <-
tibble(approved = approved) %>%
rownames_to_column('approved_id') %>%
unnest_tokens(words, approved, 'words', drop = FALSE)
approved_df %>% head
# A tibble: 6 x 3
# approved_id approved words
# <chr> <chr> <chr>
# 1 1 Cotswold Water Park Pit 28 cotswold
# 2 1 Cotswold Water Park Pit 28 water
# 3 1 Cotswold Water Park Pit 28 park
# 4 1 Cotswold Water Park Pit 28 pit
# 5 1 Cotswold Water Park Pit 28 28
# 6 2 Cotswold Water Park Pit 14 cotswold
messy_df <-
tibble(messy = messy) %>%
rownames_to_column('messy_id') %>%
unnest_tokens(words, messy, 'words', drop = FALSE)
messy_df %>% head
# # A tibble: 6 x 3
# messy_id messy words
# <chr> <chr> <chr>
# 1 1 Pit 28 pit
# 2 1 Pit 28 28
# 3 2 28 28
# 4 3 CWP Pit 28 cwp
# 5 3 CWP Pit 28 pit
# 6 3 CWP Pit 28 28
Run Code Online (Sandbox Code Playgroud)
最后,在单词级别连接两个数据框,计算重叠中的单词数量,然后为每个“凌乱”的字符串分配一个“已批准的”
## join the data sets and rank by the number of words in the overlap
messy_df %>% left_join(approved_df) %>%
group_by(messy, messy_id, approved, approved_id) %>%
summarise(n_row = n()) %>%
ungroup %>%
group_by(messy, messy_id) %>%
mutate(approved_rank = rank(desc(n_row))) %>%
ungroup %>%
filter(approved_rank == 1) %>%
arrange(messy_id)
# Joining, by = "words"
# # A tibble: 6 x 6
# messy messy_id approved approved_id n_row approved_rank
# <chr> <chr> <chr> <chr> <int> <dbl>
# 1 Pit 28 1 Cotswold Water Park Pit 28 1 2 1
# 2 28 2 Cotswold Water Park Pit 28 1 1 1
# 3 CWP Pit 28 3 Cotswold Water Park Pit 28 1 2 1
# 4 Cotswold 28 4 Cotswold Water Park Pit 28 1 2 1
# 5 14 5 Cotswold Water Park Pit 14 2 1 1
# 6 Robinswood 6 Robinswood Hill 3 1 1
Run Code Online (Sandbox Code Playgroud)
这是一个高度灵活的 regex_join 解决方案
library( fuzzyjoin )
library( data.table )
#make data.frames
messy.df <- data.frame( messy ); approved.df <- data.frame( approved )
#create regexes
messy.df$regex <- gsub( " ", "|", messy.df$messy )
#regex join
ans <- regex_full_join( approved.df, messy.df, by = c("approved" = "regex") )
#cast to wide
dcast( setDT(ans), messy~approved, value.var = "messy")[, -1]
# Cotswold Water Park Pit 14 Cotswold Water Park Pit 28 Robinswood Hill
# 1: 14 <NA> <NA>
# 2: <NA> 28 <NA>
# 3: CWP Pit 28 CWP Pit 28 <NA>
# 4: Cotswold 28 Cotswold 28 <NA>
# 5: Pit 28 Pit 28 <NA>
# 6: <NA> <NA> Robinswood
Run Code Online (Sandbox Code Playgroud)