比r中的gsub更快的方法

mar*_*abe 12 regex r

我试图找出,如果在R中有比gsub矢量化函数更快的方法.我有一些"句子"(发送$ words)后面的数据框然后我有从这些句子中删除的单词(存储在wordsForRemoving变量中) ).

sent <- data.frame(words = 
                     c("just right size and i love this notebook", "benefits great laptop",
                       "wouldnt bad notebook", "very good quality", "bad orgtop but great",
                       "great improvement for that bad product but overall is not good", 
                       "notebook is not good but i love batterytop"), 
                   user = c(1,2,3,4,5,6,7),
                   stringsAsFactors=F)

wordsForRemoving <- c("great","improvement","love","great improvement","very good","good",
                      "right", "very","benefits", "extra","benefit","top","extraordinarily",
                      "extraordinary", "super","benefits super","good","benefits great",
                      "wouldnt bad")
Run Code Online (Sandbox Code Playgroud)

然后我将为时间消耗计算创建"大数据"模拟......

df.expanded <- as.data.frame(replicate(1000000,sent$words))
library(zoo)
sent <- coredata(sent)[rep(seq(nrow(sent)),1000000),]
rownames(sent) <- NULL
Run Code Online (Sandbox Code Playgroud)

使用以下gsub方法从发送的$ words中删除单词(wordsForRemoving)需要72.87秒.我知道,这不是很好的模拟,但实际上我使用的单词字典超过3.000个单词,300,000个句子,整体处理时间超过1.5小时.

pattern <- paste0("\\b(?:", paste(wordsForRemoving, collapse = "|"), ")\\b ?")
res <- gsub(pattern, "", sent$words)

#  user  system elapsed 
# 72.87    0.05   73.79
Run Code Online (Sandbox Code Playgroud)

拜托,任何人都可以帮助我为我的任务编写更快的方法.非常感谢任何帮助或建议.非常感谢前进.

vra*_*js5 17

正如Jason所说,stringi对你来说是个不错的选择..

以下是stringi的性能

system.time(res <- gsub(pattern, "", sent$words))
   user  system elapsed 
 66.229   0.000  66.199 

library(stringi)
system.time(stri_replace_all_regex(sent$words, pattern, ""))
   user  system elapsed 
 21.246   0.320  21.552 
Run Code Online (Sandbox Code Playgroud)

更新(谢谢Arun)

system.time(res <- gsub(pattern, "", sent$words, perl = TRUE))
   user  system elapsed 
 12.290   0.000  12.281 
Run Code Online (Sandbox Code Playgroud)

  • 尝试使用`perl = TRUE`来基础R`gsub`. (15认同)

SeG*_*eGa 6

这不是一个真正的答案,因为我没有找到任何总是更快的方法。显然,这取决于您的文本/矢量的长度。使用短文本gsub执行速度最快。对于较长的文本或向量,有时gsub使用perl=TRUE且有时stri_replace_all_regex执行速度最快。

下面是一些要尝试的测试代码:

library(stringi)
text = "(a1,\"something (f fdd71)\");(b2,\"something else (a fa171)\");(b4,\"something else (a fa171)\")"
# text = paste(rep(text, 5), collapse = ",")
# text = rep(text, 100)
nchar(text)

a = gsub(pattern = "[()]", replacement = "", x = text)
b = gsub(pattern = "[()]", replacement = "", x = text, perl=T)
c = stri_replace_all_regex(str = text, pattern = "[()]", replacement = "")
d = stri_replace(str = text, regex = "[()]", replacement = "", mode="all")

identical(a,b); identical(a,c); identical(a,d)

library(microbenchmark)
mc <- microbenchmark(
  gsub = gsub(pattern = "[()]", replacement = "", x = text),
  gsub_perl = gsub(pattern = "[()]", replacement = "", x = text, perl=T),
  stringi_all = stri_replace_all_regex(str = text, pattern = "[()]", replacement = ""),
  stringi = stri_replace(str = text, regex = "[()]", replacement = "", mode="all")
)
mc
Run Code Online (Sandbox Code Playgroud)
Unit: microseconds
        expr    min      lq     mean  median     uq     max neval  cld
        gsub 10.868 11.7740 13.47869 13.5840 14.490  31.394   100 a   
   gsub_perl 79.690 80.2945 82.58225 82.4070 83.312 137.043   100    d
 stringi_all 14.188 14.7920 15.58558 15.5460 16.301  17.509   100  b  
     stringi 36.828 38.0350 39.90904 38.7895 39.543 129.194   100   c
Run Code Online (Sandbox Code Playgroud)