在字符串拆分后访问第n个元素

sym*_*ush 7 string split r apply

我有一个看起来像这样的字符串:

string <- c("A,1,some text,200", "B,2,some other text,300", "A,3,yet another one,100")
Run Code Online (Sandbox Code Playgroud)

所以每个向量元素都用逗号进一步划分.现在我只想在某个地方提取元素.让我们说出第一个逗号之前的所有元素或第二个逗号之后的所有元素.

以下代码执行我想要的操作:

sapply(strsplit(string, ","), function(x){return(x[[1]])})
# [1] "A" "B" "A"
sapply(strsplit(string, ","), function(x){return(x[[3]])})
# [1] "some text" "some other text" "yet another one"
Run Code Online (Sandbox Code Playgroud)

但是这个代码对我来说似乎相当复杂(考虑到问题的简单性).是否有更简洁的选择来实现我想要的?

G. *_*eck 7

1)data.frame转换为数据框,然后很容易挑选列或列的子集:

DF <- read.table(text = string, sep = ",", as.is = TRUE)

DF[[1]]
## [1] "A" "B" "A"

DF[[3]]
## [1] "some text"       "some other text" "yet another one"

DF[-1]
##   V2              V3  V4
## 1  1       some text 200
## 2  2 some other text 300
## 3  3 yet another one 100

DF[2:3]
##   V2              V3
## 1  1       some text
## 2  2 some other text
## 3  3 yet another one
Run Code Online (Sandbox Code Playgroud)

2)data.table :: tranpose data.table包具有转置列表的功能,以便if stringt转换列表然后stringt[[3]]是第三个字段的向量,例如,以与(1)类似的方式.更紧凑的是@Henrik tstrsplit下面提到的data.table或下面的@akrun 提到的相同包fread.

library(data.table)

stringt <- transpose(strsplit(string, ","))

# or
stringt <- tstrsplit(string, ",")

stringt[[1]]
## [1] "A" "B" "A"

stringt[[3]]
## [1] "some text"       "some other text" "yet another one"

stringt[-1]
## [[1]]
## [1] "1" "2" "3"
##
## [[2]]
## [1] "some text"       "some other text" "yet another one"
##
## [[3]]
## [1] "200" "300" "100"

stringt[2:3]
## [[1]]
## [1] "1" "2" "3"
##
## [[2]]
## [1] "some text"       "some other text" "yet another one"
Run Code Online (Sandbox Code Playgroud)

purrr也有transpose功能但是

library(purrr)
transpose(strsplit(string, ","))
Run Code Online (Sandbox Code Playgroud)

生成列表列表而不是字符向量列表.

  • 而不是`transpose(strsplit`,你可以使用方便函数`tstrsplit` (2认同)

Ron*_*hah 6

一种选择是使用wordfrom stringrwith separgument

library(stringr)
word(string, 1, sep = ",")
#[1] "A" "B" "A"

word(string, 3, sep = ",")
#[1] "some text"       "some other text" "yet another one"
Run Code Online (Sandbox Code Playgroud)

由于性能word最差,我发现在基础R中使用正则表达式的另一个选项.

#Get 1st element
sub("(?:[^,],){0}([^,]*).*", "\\1",string)
#[1] "A" "B" "A"

#Get 3rd element
sub("(?:[^,],){2}([^,]*).*", "\\1",string)
#[1] "some text"       "some other text" "yet another one"
Run Code Online (Sandbox Code Playgroud)

这里有两组匹配.第一个匹配任何不是逗号的字符,后跟逗号n一次,然后再匹配另一组不是逗号的字符.?:捕获并返回第二组时,未捕获第一组().另请注意,括号({})中的数字必须比我们想要的单词少一个.所以{0}返回第一个单词并{2}返回第三个单词.

基准

string <- c("A,1,some text,200","B,2,some other text,300","A,3,yet another one,100")
string <- rep(string, 1e5)

library(microbenchmark)
microbenchmark(
  tmfmnk_sapply = sapply(strsplit(string, ","), function(x) x[1]),
  tmfmnk_tstrsplit = tstrsplit(string, ",")[[1]],
  avid_useR_sapply = sapply(strsplit(string, ","), '[', 1),
  avid_useR_str_split = str_split(string, ",", simplify = TRUE)[,1],
  Ronak_Shah_word = word(string, 1, sep = ","),
  Ronak_Shah_sub = sub("(?:[^,],){0}([^,]*).*", "\\1",string),
  G_Grothendieck ={DF <- read.table(text = string, sep = ",",as.is = TRUE);DF[[1]]},
  times = 5
)
#Unit: milliseconds
#               expr     min      lq    mean  median      uq     max neval
#      tmfmnk_sapply 1629.69 1641.61 2128.14 1834.99 1893.43 3640.96     5
#   tmfmnk_tstrsplit 1269.94 1283.79 1286.29 1286.68 1290.76 1300.30     5
#   avid_useR_sapply 1445.40 1447.64 1555.76 1498.14 1609.52 1778.13     5
#avid_useR_str_split  324.68  332.28  332.30  333.97  334.01  336.54     5
#    Ronak_Shah_word 6571.29 6810.92 6956.20 6930.86 7217.26 7250.69     5
#     Ronak_Shah_sub  349.76  354.77  356.91  358.91  359.17  361.94     5
#     G_Grothendieck  354.93  358.24  364.43  362.24  367.79  378.94     5
Run Code Online (Sandbox Code Playgroud)

我没有包含Christoph的解决方案,因为我不清楚它如何对变量起作用n.例如,对于第3位置,对于第4位置等.


avi*_*seR 5

我们可以将OP的代码简化为:

sapply(strsplit(string, ","), '[', 1)
# [1] "A" "B" "A"

sapply(strsplit(string, ","), '[', 3)
# [1] "some text"       "some other text" "yet another one"
Run Code Online (Sandbox Code Playgroud)

另外,使用stringr::str_splitsimplify = TRUE,我们可以直接索引列,因为输出将是一个矩阵:

library(stringr)
str_split(string, ",", simplify = TRUE)[,1]
# [1] "A" "B" "A"

str_split(string, ",", simplify = TRUE)[,3]
# [1] "some text"       "some other text" "yet another one"
Run Code Online (Sandbox Code Playgroud)