我试图找出,如果在R中有比gsub矢量化函数更快的方法.我有一些"句子"(发送$ words)后面的数据框然后我有从这些句子中删除的单词(存储在wordsForRemoving变量中) ).
sent <- data.frame(words =
c("just right size and i love this notebook", "benefits great laptop",
"wouldnt bad notebook", "very good quality", "bad orgtop but great",
"great improvement for that bad product but overall is not good",
"notebook is not good but i love batterytop"),
user = c(1,2,3,4,5,6,7),
stringsAsFactors=F)
wordsForRemoving <- c("great","improvement","love","great improvement","very good","good",
"right", "very","benefits", "extra","benefit","top","extraordinarily",
"extraordinary", "super","benefits super","good","benefits great",
"wouldnt bad")
Run Code Online (Sandbox Code Playgroud)
然后我将为时间消耗计算创建"大数据"模拟......
df.expanded <- as.data.frame(replicate(1000000,sent$words))
library(zoo)
sent <- coredata(sent)[rep(seq(nrow(sent)),1000000),]
rownames(sent) <- NULL
Run Code Online (Sandbox Code Playgroud)
使用以下gsub方法从发送的$ words中删除单词(wordsForRemoving)需要72.87秒.我知道,这不是很好的模拟,但实际上我使用的单词字典超过3.000个单词,300,000个句子,整体处理时间超过1.5小时.
pattern <- paste0("\\b(?:", paste(wordsForRemoving, collapse = "|"), ")\\b ?")
res <- gsub(pattern, "", sent$words)
# user system elapsed
# 72.87 0.05 73.79
Run Code Online (Sandbox Code Playgroud)
拜托,任何人都可以帮助我为我的任务编写更快的方法.非常感谢任何帮助或建议.非常感谢前进.
vra*_*js5 17
正如Jason所说,stringi对你来说是个不错的选择..
以下是stringi的性能
system.time(res <- gsub(pattern, "", sent$words))
user system elapsed
66.229 0.000 66.199
library(stringi)
system.time(stri_replace_all_regex(sent$words, pattern, ""))
user system elapsed
21.246 0.320 21.552
Run Code Online (Sandbox Code Playgroud)
更新(谢谢Arun)
system.time(res <- gsub(pattern, "", sent$words, perl = TRUE))
user system elapsed
12.290 0.000 12.281
Run Code Online (Sandbox Code Playgroud)
这不是一个真正的答案,因为我没有找到任何总是更快的方法。显然,这取决于您的文本/矢量的长度。使用短文本gsub
执行速度最快。对于较长的文本或向量,有时gsub
使用perl=TRUE
且有时stri_replace_all_regex
执行速度最快。
下面是一些要尝试的测试代码:
library(stringi)
text = "(a1,\"something (f fdd71)\");(b2,\"something else (a fa171)\");(b4,\"something else (a fa171)\")"
# text = paste(rep(text, 5), collapse = ",")
# text = rep(text, 100)
nchar(text)
a = gsub(pattern = "[()]", replacement = "", x = text)
b = gsub(pattern = "[()]", replacement = "", x = text, perl=T)
c = stri_replace_all_regex(str = text, pattern = "[()]", replacement = "")
d = stri_replace(str = text, regex = "[()]", replacement = "", mode="all")
identical(a,b); identical(a,c); identical(a,d)
library(microbenchmark)
mc <- microbenchmark(
gsub = gsub(pattern = "[()]", replacement = "", x = text),
gsub_perl = gsub(pattern = "[()]", replacement = "", x = text, perl=T),
stringi_all = stri_replace_all_regex(str = text, pattern = "[()]", replacement = ""),
stringi = stri_replace(str = text, regex = "[()]", replacement = "", mode="all")
)
mc
Run Code Online (Sandbox Code Playgroud)
Run Code Online (Sandbox Code Playgroud)Unit: microseconds expr min lq mean median uq max neval cld gsub 10.868 11.7740 13.47869 13.5840 14.490 31.394 100 a gsub_perl 79.690 80.2945 82.58225 82.4070 83.312 137.043 100 d stringi_all 14.188 14.7920 15.58558 15.5460 16.301 17.509 100 b stringi 36.828 38.0350 39.90904 38.7895 39.543 129.194 100 c