r- grepl查找多个字符串存在

too*_*lik 3 r grepl

grepl("instance|percentage", labelTest$Text)
Run Code Online (Sandbox Code Playgroud)

如果存在instance或中的任何一个,将返回true percentage

仅当同时存在这两个术语时,我才能如何实现。

Aks*_*elA 11

Text <- c("instance", "percentage", "n", 
          "instance percentage", "percentage instance")

grepl("instance|percentage", Text)
# TRUE  TRUE FALSE  TRUE  TRUE

grepl("instance.*percentage|percentage.*instance", Text)
# FALSE FALSE FALSE TRUE  TRUE
Run Code Online (Sandbox Code Playgroud)

后者通过寻找:

('instance')(any character sequence)('percentage')  
OR  
('percentage')(any character sequence)('instance')
Run Code Online (Sandbox Code Playgroud)

自然,如果您需要找到两个以上单词的任意组合,这将变得非常复杂。这样,注释中提到的解决方案将更易于实现和阅读。

在匹配多个单词时可能涉及的另一种选择是使用正向预见(可以认为是“非消耗性”匹配)。为此,您必须激活perl正则表达式。

# create a vector of word combinations
set.seed(1)
words <- c("instance", "percentage", "element",
           "character", "n", "o", "p")
Text2 <- replicate(10, paste(sample(words, 5), collapse=" "))

# grepl with multiple positive look-ahead
longperl <- grepl("(?=.*instance)(?=.*percentage)(?=.*element)(?=.*character)",
  Text2, perl=TRUE)

# this is equivalent to the solution proposed in the comments
longstrd <- grepl("instance", Text2) & 
          grepl("percentage", Text2) & 
             grepl("element", Text2) & 
           grepl("character", Text2)

# they produce identical results
identical(longperl, longstrd)
Run Code Online (Sandbox Code Playgroud)

此外,如果将模式存储在向量中,则可以显着压缩表达式,从而为您提供

pat <- c("instance", "percentage", "element", "character")

longperl <- grepl(paste0("(?=.*", pat, ")", collapse=""), Text2, perl=TRUE)
longstrd <- rowSums(sapply(pat, grepl, Text2) - 1L) == 0L
Run Code Online (Sandbox Code Playgroud)

如注释中所要求的,如果要匹配精确的单词,即不匹配子字符串,我们可以使用来指定单词边界\\b。例如:

tx <- c("cent element", "percentage element", "element cent", "element centimetre")

grepl("(?=.*\\bcent\\b)(?=.*element)", tx, perl=TRUE)
# TRUE FALSE  TRUE FALSE
grepl("element", tx) & grepl("\\bcent\\b", tx)
# TRUE FALSE  TRUE FALSE
Run Code Online (Sandbox Code Playgroud)