R:提取两个子串之间包含的所有子串的最快方法

Question

R:提取两个子串之间包含的所有子串的最快方法

我正在寻找一种有效的方法来提取字符串中两个子串之间的所有匹配.例如,我想要提取字符串之间包含的所有子字符串

start="strt"

Run Code Online (Sandbox Code Playgroud)

和

stop="stp"
in string
x="strt111stpblablastrt222stp"

Run Code Online (Sandbox Code Playgroud)

我想得到矢量

"111" "222"

Run Code Online (Sandbox Code Playgroud)

R中最有效的方法是什么？也许使用正则表达式？还是有更好的方法？

Answer 1

hwn*_*wnd 14

对于像这样简单的东西,基地R处理这个很好.

您可以使用并使用外观断言来打开PCRE.perl=T

x <- 'strt111stpblablastrt222stp'
regmatches(x, gregexpr('(?<=strt).*?(?=stp)', x, perl=T))[[1]]
# [1] "111" "222"

Run Code Online (Sandbox Code Playgroud)

说明:

(?<=          # look behind to see if there is:
  strt        #   'strt'
)             # end of look-behind
.*?           # any character except \n (0 or more times)
(?=           # look ahead to see if there is:
  stp         #   'stp'
)             # end of look-ahead

Run Code Online (Sandbox Code Playgroud)

编辑:根据新语法更新以下答案.

您也可以考虑使用stringi包.

library(stringi)
x <- 'strt111stpblablastrt222stp'
stri_extract_all_regex(x, '(?<=strt).*?(?=stp)')[[1]]
# [1] "111" "222"

Run Code Online (Sandbox Code Playgroud)

并rm_between从qdapRegex包中.

library(qdapRegex)
x <- 'strt111stpblablastrt222stp'
rm_between(x, 'strt', 'stp', extract=TRUE)[[1]]
# [1] "111" "222"

Run Code Online (Sandbox Code Playgroud)

Answer 2

bar*_*nus 6

如果你在谈论R字符串中的速度,那么只有一个包可以做到这一点 - stringi

 x <- "strt111stpblablastrt222stp"
 hwnd <- function(x1) regmatches(x1,gregexpr('(?<=strt).*?(?=stp)',x1,perl=T))
 Tim <- function(x1) regmatches(x1, gregexpr("(?<=strt)(?:(?!stp).)*", x1, perl=TRUE))
 stringr <- function(x1) str_extract_all(x1, perl('(?<=strt).*?(?=stp)'))
 akrun <- function(x1) genXtract(x1, "strt", "stp")
 stringi <- function(x1) stri_extract_all_regex(x1, perl('(?<=strt).*?(?=stp)'))

 require(microbenchmark)
 microbenchmark(stringi(x), hwnd(x), Tim(x), stringr(x))
Unit: microseconds
       expr     min       lq  median       uq     max neval
 stringi(x)  46.778  58.1030  64.017  67.3485 123.398   100
    hwnd(x)  61.498  73.1095  79.084  85.5190 111.757   100
     Tim(x)  60.243  74.6830  80.755  86.3370 102.678   100
 stringr(x) 236.081 261.9425 272.115 279.6750 440.036   100

Run Code Online (Sandbox Code Playgroud)

不幸的是我无法测试@akrun解决方案,因为qdap软件包在安装过程中有一些错误.只有他的解决方案看起来像能击败弦乐的人...

我不只是`stringi`粉丝 - 我是作家:) (7认同)
我希望`genXtract`要慢得多(慢10-20倍).它的灵活性和易用性.在许多情况下,研究人员的时间比计算时间更有价值.如果是这样的话,`genXtract`是一个很好的选择.如果你追求速度,那么我,像你一样,是'stringi`的忠实粉丝. (6认同)

Answer 3

akr*_*run 5

您还可以考虑:

library(qdap)
unname(genXtract(x, "strt", "stp"))
#[1] "111" "222"

Run Code Online (Sandbox Code Playgroud)

速度比较

 x1 <- rep(x,1e5)
 system.time(res1 <- regmatches(x1,gregexpr('(?<=strt).*?(?=stp)',x1,perl=T)))
 #   user  system elapsed 
 #  2.187   0.000   2.015 

 system.time(res2 <- regmatches(x1, gregexpr("(?<=strt)(?:(?!stp).)*", x1, perl=TRUE)))
 #user  system elapsed 
 #  1.902   0.000   1.780 

 system.time(res3 <- str_extract_all(x1, perl('(?<=strt).*?(?=stp)')))
 # user  system elapsed 
 #  6.990   0.000   6.636 

 system.time(res4 <- genXtract(x1, "strt", "stp")) ##setNames(genXtract(...), NULL) is a bit slower
 # user  system elapsed 
 # 1.457   0.000   1.414 

 names(res4) <- NULL
identical(res1,res4)
#[1] TRUE

Run Code Online (Sandbox Code Playgroud)

归档时间：	11 年，11 月前
查看次数：	2784 次
最近记录：	10 年，10 月前