带子串的两个字符向量的差异

oli*_*r13 3 r

我有两个清单:

a <- c("da", "ba", "cs", "dd", "ek")
b <- c("zyc", "ulk", "mae", "csh", "ddi", "dada")
Run Code Online (Sandbox Code Playgroud)

我想从列表b中删除元素,这些元素的子字符串与a中的任何值匹配,例如

grepl("da","dada") # TRUE
Run Code Online (Sandbox Code Playgroud)

你会如何有效地做到这一点?

akr*_*run 10

我们可以paste将'a'元素作为单个字符串|作为分隔符,使用它作为patternin grepl,negate(!)到子集'b'.

 b[!grepl(paste(a, collapse="|"), b)]
Run Code Online (Sandbox Code Playgroud)


Jan*_*aan 5

另一个使用简单for循环的解决方案:

sel <- rep(FALSE, length(b))
for (i in seq_along(a)) {
  sel <- sel | grepl(a[i], b, fixed = TRUE)
}
b[!sel]
Run Code Online (Sandbox Code Playgroud)

不像其他解决方案那样优雅(尤其是akrun的解决方案),但是表明for循环并不像人们所认为的那样在R中总是那么慢:

fun1 <- function(a, b) {
  sel <- rep(FALSE, length(b))
  for (i in seq_along(a)) {
    sel <- sel | grepl(a[i], b, fixed = TRUE)
  }
  b[!sel]
}

fun2 <- function(a, b) {
  b[!apply(sapply(a, function(x) grepl(x,b, fixed=TRUE)),1,sum)]
}

fun3 <- function(a, b) {
  b[-which(sapply(a, grepl, b, fixed=TRUE), arr.ind = TRUE)[, "row"]]
}

fun4 <- function(a, b) {
  b[!grepl(paste(a, collapse="|"), b)]
}

library(stringr)
fun5 <- function(a, b) {
  b[!sapply(b, function(u) any(str_detect(u,a)))]
}

a <- c("da", "ba", "cs", "dd", "ek")
b <- c("zyc", "ulk", "mae", "csh", "ddi", "dada")
b <- rep(b, length.out = 1E3)

library(microbenchmark)
microbenchmark(fun1(a, b), fun2(a, b), fun3(a,b), fun4(a,b), fun5(a,b))


# Unit: microseconds
#       expr       min        lq       mean    median         uq        max neval  cld
# fun1(a, b)   389.630   399.128   408.6146   406.007   411.7690    540.969   100 a   
# fun2(a, b)  5274.143  5445.038  6183.3945  5544.522  5762.1750  35830.143   100   c 
# fun3(a, b)  2568.734  2629.494  2691.8360  2686.552  2729.0840   2956.618   100  b  
# fun4(a, b)   482.585   511.917   530.0885   528.993   541.6685    779.679   100 a   
# fun5(a, b) 53846.970 54293.798 56337.6531 54861.585 55184.3100 132921.883   100    d
Run Code Online (Sandbox Code Playgroud)

  • 是的,微秒基准没有意义,你应该创建一个更大的数据集IMO (2认同)