scr*_*Owl 5 regex string split r
我有一个大约有4千万行的文件,我需要根据第一个逗号分隔符进行拆分.
以下使用该stringr功能str_split_fixed运行良好,但速度很慢.
library(data.table)
library(stringr)
df1 <- data.frame(id = 1:1000, letter1 = rep(letters[sample(1:25,1000, replace = T)], 40))
df1$combCol1 <- paste(df1$id, ',',df1$letter1, sep = '')
df1$combCol2 <- paste(df1$combCol1, ',', df1$combCol1, sep = '')
st1 <- str_split_fixed(df1$combCol2, ',', 2)
Run Code Online (Sandbox Code Playgroud)
有什么建议可以更快地完成这项工作吗?
在stri_split_fixed较新版本"stringi"的功能有一个simplify可以设置为参数TRUE返回一个矩阵.因此,更新的解决方案将是:
stri_split_fixed(df1$combCol2, ",", 2, simplify = TRUE)
Run Code Online (Sandbox Code Playgroud)
如果您对"stringr"语法感到满意并且不想偏离它太远,但您也希望从速度提升中获益,请尝试使用"stringi"包:
library(stringr)
library(stringi)
system.time(temp1 <- str_split_fixed(df1$combCol2, ',', 2))
# user system elapsed
# 3.25 0.00 3.25
system.time(temp2a <- do.call(rbind, stri_split_fixed(df1$combCol2, ",", 2)))
# user system elapsed
# 0.04 0.00 0.05
system.time(temp2b <- stri_split_fixed(df1$combCol2, ",", 2, simplify = TRUE))
# user system elapsed
# 0.01 0.00 0.01
Run Code Online (Sandbox Code Playgroud)
大多数"stringr"函数都有"stringi"的相似之处,但从这个例子中可以看出,"stringi"输出需要一个额外的步骤来绑定数据,以将输出创建为矩阵而不是列表.
以下是与评论中@ RichardScriven建议的比较:
fun1a <- function() do.call(rbind, stri_split_fixed(df1$combCol2, ",", 2))
fun1b <- function() stri_split_fixed(df1$combCol2, ",", 2, simplify = TRUE)
fun2 <- function() {
do.call(rbind, regmatches(df1$combCol2, regexpr(",", df1$combCol2),
invert = TRUE))
}
library(microbenchmark)
microbenchmark(fun1a(), fun1b(), fun2(), times = 10)
# Unit: milliseconds
# expr min lq mean median uq max neval
# fun1a() 42.72647 46.35848 59.56948 51.94796 69.29920 98.46330 10
# fun1b() 17.55183 18.59337 20.09049 18.84907 22.09419 26.85343 10
# fun2() 370.82055 404.23115 434.62582 439.54923 476.02889 480.97912 10
Run Code Online (Sandbox Code Playgroud)