用于过滤数据帧的 stringr::str_detect 的模糊版本

Bjö*_*örn 11 r string-matching fuzzy-comparison levenshtein-distance stringr

我有一个带有自由文本字段的数据库,我想将其用于filteradata.frametibble。我也许可以通过大量工作创建一个数据中当前出现的搜索词的所有可能拼写错误的列表(请参阅下面一个术语的所有拼写示例),然后我可以像下面的示例代码一样stringr::str_detect使用。然而,当将来可能出现更多拼写错误时,这并不安全。如果我愿意接受一些限制/做出一些假设(例如,拼写错误之间的编辑距离可能有多远,或者就其他一些差异而言,人们不会使用完全不同的术语等),是否有一些做模糊版本的简单解决方案str_detect

据我所知,明显的软件包似乎stringdist没有直接执行此操作的功能。我想我可以编写自己的函数,将类似stringdist::afind或的东西应用于向量的每个元素,并后处理结果以最终返回或布尔stringdist::amatch值的向量,但我想知道这个函数是否不存在于某处(并且更有效地实现)比我会做的)。TRUEFALSE

这是一个示例,说明了我如何str_detect可能会错过我想要的一行:

library(tidyverse)

search_terms = c("preclinical", "Preclincal", "Preclincial", "Preclinial", 
                 "Precllinical", "Preclilnical", "Preclinica", "Preclnical", 
                 "Peclinical", "Prclinical", "Peeclinical", "Pre clinical", 
                 "Precclinical", "Preclicnial", "Precliical", "Precliinical", 
                 "Preclinal", "Preclincail", "Preclinicgal", "Priclinical")

example_data = tibble(project=c("A111", "A123", "B112", "A224", "C149"),
                      disease_phase=c("Diabetes, Preclinical", "Lipid lowering, Perlcinical", 
                                      "Asthma, Phase I", "Phase II; Hypertension", "Phase 3"),
                      startdate = c("01DEC2018", "17-OKT-2017", "11/15/2019", "1. Dezember 2004", "2005-11-30")) 

# Finds only project A111, but not A123
example_data %>%
  filter(str_detect(tolower(disease_phase), paste0(tolower(search_terms), collapse="|")))
Run Code Online (Sandbox Code Playgroud)

GKi*_*GKi 8

您可以使用baseagrepl中的近似字符串匹配(模糊匹配)。

\n
example_data[agrep(paste(search_terms, collapse = "|"),\n  example_data$disease_phase, 2, ignore.case=TRUE, fixed=FALSE),]\n#  project               disease_phase   startdate\n#1    A111       Diabetes, Preclinical   01DEC2018\n#2    A123 Lipid lowering, Perlcinical 17-OKT-2017\n
Run Code Online (Sandbox Code Playgroud)\n

Reduce或者在正则|表达式中使用代替。

\n
example_data[Reduce(\\(y, x) y | agrepl(x, example_data$disease_phase, 2,\n  ignore.case=TRUE), search_terms, FALSE),]\n#  project               disease_phase   startdate\n#1    A111       Diabetes, Preclinical   01DEC2018\n#2    A123 Lipid lowering, Perlcinical 17-OKT-2017\n
Run Code Online (Sandbox Code Playgroud)\n

另一种选择可能是adist,也在base中,它计算距离矩阵 - 因此可能不建议用于较大的向量,因为矩阵可能会变大。这里我也选择错配2个字符就可以了。

\n
example_data[colSums(adist(unique(search_terms), example_data$disease_phase,\n                           partial=TRUE) < 3) > 0,]\n#  project               disease_phase   startdate\n#1    A111       Diabetes, Preclinical   01DEC2018\n#2    A123 Lipid lowering, Perlcinical 17-OKT-2017\n
Run Code Online (Sandbox Code Playgroud)\n

如果仅比较单个单词,它可能会更有效,因此也可以在basedisease_phase中使用将其拆分为单词。strsplit

\n
. <- strsplit(example_data$disease_phase, "[ ,;]+")\n. <- split(rep(seq_along(.), lengths(.)), tolower(unlist(.)))\nexample_data[unique(unlist(.[Reduce(\\(y, x) `[<-`(y, !y, agrepl(x, names(.)[!y],\n   2)), tolower(search_terms), logical(length(.)))], FALSE, FALSE)),]\n#example_data[unique(unlist(.[Reduce(\\(y, x) y | agrepl(x, names(.), 2),\n#   tolower(search_terms), FALSE)], FALSE, FALSE)),] #Alternative\n#  project               disease_phase   startdate\n#2    A123 Lipid lowering, Perlcinical 17-OKT-2017\n#1    A111       Diabetes, Preclinical   01DEC2018\n
Run Code Online (Sandbox Code Playgroud)\n

一些更简单的例子使用agrep

\n
#Allow 1 character difference to make match\nagrepl("preclinical", c("precinical", "precinicalxyz", "prelcinical"), 1)\n#[1]  TRUE  TRUE FALSE\n\n#Allow 2 character difference to make match\nagrepl("preclinical", c("precinical", "precinicalxyz", "prelcinical"), 2)\n#[1] TRUE TRUE TRUE\n\n#Use boundaries to match words\nagrepl("\\\\bpreclinical\\\\b", c("xyz precinical xyz", "xyzpreclinicalxyz"), 1, fixed=FALSE)\n#[1]  TRUE FALSE\n
Run Code Online (Sandbox Code Playgroud)\n

允许的差异可以通过 max.distance 设置:

\n
max.distance: Maximum distance allowed for a match.  Expressed either\n          as integer, or as a fraction of the _pattern_ length times\n          the maximal transformation cost (will be replaced by the\n          smallest integer not less than the corresponding fraction),\n          or a list with possible components\n\n          \xe2\x80\x98cost\xe2\x80\x99: maximum number/fraction of match cost (generalized\n              Levenshtein distance)\n\n          \xe2\x80\x98all\xe2\x80\x99: maximal number/fraction of _all_ transformations\n              (insertions, deletions and substitutions)\n\n          \xe2\x80\x98insertions\xe2\x80\x99: maximum number/fraction of insertions\n\n          \xe2\x80\x98deletions\xe2\x80\x99: maximum number/fraction of deletions\n\n          \xe2\x80\x98substitutions\xe2\x80\x99: maximum number/fraction of substitutions\n
Run Code Online (Sandbox Code Playgroud)\n

还有一个基于@JBGruber的基准:

\n
system.time({  #Libraries needed for method of JBGruber\nlibrary(dplyr);\nlibrary(stringdist);\nlibrary(Rfast);\nlibrary(tidytext)\n})\n#       User      System verstrichen \n#      1.008       0.040       1.046 \n\nset.seed(42)\nexample_large <- example_data %>% sample_n(5000, replace = TRUE)\n\nstringdist_detect <- function(a, b, method = "osa", thres = 2) {\n  Rfast::rowMins(stringdist::stringdistmatrix(a, b, method = method), value = TRUE) <= thres\n}\n\nbench::mark(check = FALSE,\n  stringdist_detect = {\n     example_large %>% \n      tidytext::unnest_tokens(output = "word", input = "disease_phase", drop = FALSE) %>% \n      filter(stringdist_detect(word, tolower(search_terms), method = "lv"))\n  },\n  GKi ={. <- strsplit(example_large$disease_phase, "[ ,;]+")\n   . <- split(rep(seq_along(.), lengths(.)), tolower(unlist(.)))\n   example_large[unique(unlist(.[Reduce(\\(y, x) y | agrepl(x, names(.), 2),\n     tolower(search_terms), FALSE)], FALSE, FALSE)),]\n})\n#  expression             min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc\n#  <bch:expr>        <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>\n#1 stringdist_detect  17.42ms  18.65ms      52.8    7.15MB    19.4     19     7\n#2 GKi                 5.64ms   6.04ms     165.   869.08KB     6.27    79     3\n
Run Code Online (Sandbox Code Playgroud)\n

当 中只有一种正确书写的感兴趣单词的变体时,也可以节省大量时间search_terms

\n