Tom*_*ers 12 regex string substring r
我正在寻找一种有效的方法来提取字符串中两个子串之间的所有匹配.例如,我想要提取字符串之间包含的所有子字符串
start="strt"
Run Code Online (Sandbox Code Playgroud)
和
stop="stp"
in string
x="strt111stpblablastrt222stp"
Run Code Online (Sandbox Code Playgroud)
我想得到矢量
"111" "222"
Run Code Online (Sandbox Code Playgroud)
R中最有效的方法是什么?也许使用正则表达式?还是有更好的方法?
hwn*_*wnd 14
对于像这样简单的东西,基地R处理这个很好.
x <- 'strt111stpblablastrt222stp'
regmatches(x, gregexpr('(?<=strt).*?(?=stp)', x, perl=T))[[1]]
# [1] "111" "222"
Run Code Online (Sandbox Code Playgroud)
说明:
(?<= # look behind to see if there is:
strt # 'strt'
) # end of look-behind
.*? # any character except \n (0 or more times)
(?= # look ahead to see if there is:
stp # 'stp'
) # end of look-ahead
Run Code Online (Sandbox Code Playgroud)
编辑:根据新语法更新以下答案.
您也可以考虑使用stringi包.
library(stringi)
x <- 'strt111stpblablastrt222stp'
stri_extract_all_regex(x, '(?<=strt).*?(?=stp)')[[1]]
# [1] "111" "222"
Run Code Online (Sandbox Code Playgroud)
并rm_between从qdapRegex包中.
library(qdapRegex)
x <- 'strt111stpblablastrt222stp'
rm_between(x, 'strt', 'stp', extract=TRUE)[[1]]
# [1] "111" "222"
Run Code Online (Sandbox Code Playgroud)
如果你在谈论R字符串中的速度,那么只有一个包可以做到这一点 - stringi
x <- "strt111stpblablastrt222stp"
hwnd <- function(x1) regmatches(x1,gregexpr('(?<=strt).*?(?=stp)',x1,perl=T))
Tim <- function(x1) regmatches(x1, gregexpr("(?<=strt)(?:(?!stp).)*", x1, perl=TRUE))
stringr <- function(x1) str_extract_all(x1, perl('(?<=strt).*?(?=stp)'))
akrun <- function(x1) genXtract(x1, "strt", "stp")
stringi <- function(x1) stri_extract_all_regex(x1, perl('(?<=strt).*?(?=stp)'))
require(microbenchmark)
microbenchmark(stringi(x), hwnd(x), Tim(x), stringr(x))
Unit: microseconds
expr min lq median uq max neval
stringi(x) 46.778 58.1030 64.017 67.3485 123.398 100
hwnd(x) 61.498 73.1095 79.084 85.5190 111.757 100
Tim(x) 60.243 74.6830 80.755 86.3370 102.678 100
stringr(x) 236.081 261.9425 272.115 279.6750 440.036 100
Run Code Online (Sandbox Code Playgroud)
不幸的是我无法测试@akrun解决方案,因为qdap软件包在安装过程中有一些错误.只有他的解决方案看起来像能击败弦乐的人...
您还可以考虑:
library(qdap)
unname(genXtract(x, "strt", "stp"))
#[1] "111" "222"
Run Code Online (Sandbox Code Playgroud)
速度比较
x1 <- rep(x,1e5)
system.time(res1 <- regmatches(x1,gregexpr('(?<=strt).*?(?=stp)',x1,perl=T)))
# user system elapsed
# 2.187 0.000 2.015
system.time(res2 <- regmatches(x1, gregexpr("(?<=strt)(?:(?!stp).)*", x1, perl=TRUE)))
#user system elapsed
# 1.902 0.000 1.780
system.time(res3 <- str_extract_all(x1, perl('(?<=strt).*?(?=stp)')))
# user system elapsed
# 6.990 0.000 6.636
system.time(res4 <- genXtract(x1, "strt", "stp")) ##setNames(genXtract(...), NULL) is a bit slower
# user system elapsed
# 1.457 0.000 1.414
names(res4) <- NULL
identical(res1,res4)
#[1] TRUE
Run Code Online (Sandbox Code Playgroud)