如何使用R从一行中提取多个字符串

use*_*423 5 regex text r readlines

我想从一行中提取多个字符串.

假设我有以下文本行(使用'readLines'函数形成一个网站):

line <- "abc:city1-street1-long1-lat1,ldjad;skj//abc:city2-street2-long2-lat2,ldjad;skj//abc:city3-street3-long3-lat3,ldjad;skj//abc:city3-street3-long3-lat3,ldjad;skj//"
Run Code Online (Sandbox Code Playgroud)

我想提取以下内容分开:

[1] city1-street1-long1-lat1
[2] city2-street2-long2-lat2
[3] city3-street3-long3-lat3
[4] city4-street4-long4-lat4
Run Code Online (Sandbox Code Playgroud)

我希望有人可以给我一个如何执行此任务的提示.

the*_*ail 5

regmatches 救援:

regmatches(line,gregexpr("city\\d+-street\\d+-long\\d+-lat\\d+",line))
#[[1]]
#[1] "city1-street1-long1-lat1"
#[2] "city2-street2-long2-lat2"
#[3] "city3-street3-long3-lat3"
#[4] "city3-street3-long3-lat3"
Run Code Online (Sandbox Code Playgroud)


gag*_*ews 4

使用stringi包的解决方案:

library(stringi)
stri_extract_all_regex(line, "(?<=:).+?(?=,)")[[1]]
## [1] "city1-street1-long1-lat1" "city2-street2-long2-lat2" "city3-street3-long3-lat3" "city3-street3-long3-lat3"
Run Code Online (Sandbox Code Playgroud)

并使用stringr包:

library(stringr)
str_extract_all(line, perl("(?<=:).+?(?=,)"))[[1]]
## [1] "city1-street1-long1-lat1" "city2-street2-long2-lat2" "city3-street3-long3-lat3" "city3-street3-long3-lat3"
Run Code Online (Sandbox Code Playgroud)

在这两种情况下我们都使用正则表达式。在这里,我们匹配和之间出现的所有字符(非贪婪地,即与.+?)。表示积极的后视:将匹配,但不包含在结果中。另一方面,是积极的前瞻:必须匹配但不会出现在输出中。:,(?<=:):(?=,),

一些基准:

lines <- stri_dup(line, 250) # duplicate line 250 times
library(microbenchmark)
microbenchmark(
   stri_extract_all_regex(lines, "(?<=:).+?(?=,)")[[1]],
   str_extract_all(lines, perl("(?<=:).+?(?=,)"))[[1]],
   regmatches(lines, gregexpr("city\\d+-street\\d+-long\\d+-lat\\d+", lines)),
   lapply(unlist(strsplit(lines,',')),
       function(x)unlist(strsplit(x,':'))[2]),
   lapply(strsplit(lines,'//'),
        function(x)
          sub('.*:(.*),.*','\\1',x))
)
## Unit: milliseconds
##                            expr         min         lq     median             uq        max neval
## gagolews-stri_extract_all_regex    4.722515   4.811009   4.835948       4.883854   6.080912   100
##        gagolews-str_extract_all  103.514964 103.824223 104.387175     106.246773 117.279208   100
##          thelatemail-regmatches   36.049106  36.172549  36.342945      36.967325  47.399339   100
##                  agstudy-lapply   21.152761  21.500726  21.792979      22.809145  37.273120   100
##                 agstudy-lapply2    8.763783   8.854666   8.930955       9.128782  10.302468   100
Run Code Online (Sandbox Code Playgroud)

如您所见,stringi基于 - 的解决方案是最快的。