PHP中的str_replace(和preg_replace)函数用替换字符串替换所有出现的搜索字符串.我最感兴趣的是,如果search和replaceargs是数组(在R中我们称之为向量),str_replace则从每个数组(向量)中获取一个值并使用它们来搜索和替换主题.
换句话说,R(或某些R包)是否具有执行以下功能的功能:
string <- "The quick brown fox jumped over the lazy dog."
patterns <- c("quick", "brown", "fox")
replacements <- c("slow", "black", "bear")
xxx_replace_xxx(string, patterns, replacements) ## ???
## [1] "The slow black bear jumped over the lazy dog."
Run Code Online (Sandbox Code Playgroud)
所以我正在寻找类似的东西chartr,但是对于搜索模式和任意数量字符的替换字符串.这不能通过一次调用来完成,gsub()因为它的replacement参数只能是一个字符串,参见?gsub.所以我目前的实现是这样的:
xxx_replace_xxx <- function(string, patterns, replacements) {
for (i in seq_along(patterns))
string <- gsub(patterns[i], replacements[i], string, fixed=TRUE)
string
}
Run Code Online (Sandbox Code Playgroud)
但是,如果length(patterns)很大,我正在寻找更快的东西- 我有很多数据需要处理,而且我对目前的结果不满意.
用于基准测试的示范玩具数据:
string <- readLines("http://www.gutenberg.org/files/31536/31536-0.txt", encoding="UTF-8")
patterns <- c("jak", "to", "do", "z", "na", "i", "w", "za", "tu", "gdy",
"po", "jest", "Tadeusz", "lub", "razem", "nas", "przy", "oczy", "czy",
"sam", "u", "tylko", "bez", "ich", "Telimena", "Wojski", "jeszcze")
replacements <- paste0(patterns, rev(patterns))
Run Code Online (Sandbox Code Playgroud)
Jos*_*ich 10
对于您的示例,使用PCRE而不是固定匹配在我的机器上花费约1/3的时间.
xxx_replace_xxx_pcre <- function(string, patterns, replacements) {
for (i in seq_along(patterns))
string <- gsub(patterns[i], replacements[i], string, perl=TRUE)
string
}
system.time(x <- xxx_replace_xxx(string, patterns, replacements))
# user system elapsed
# 0.491 0.000 0.491
system.time(p <- xxx_replace_xxx_pcre(string, patterns, replacements))
# user system elapsed
# 0.162 0.000 0.162
identical(x,p)
# [1] TRUE
Run Code Online (Sandbox Code Playgroud)
如果模式是由示例中的单词字符组成的固定字符串,那么这是有效的. gsubfn就像,gsub除了replacment参数可以是字符串,列表,函数或proto对象.如果它是一个列表,就像这里一样,它将正则表达式的匹配与名称进行比较,对于找到的那些,它将用相应的值替换它们:
library(gsubfn)
gsubfn("\\b\\w+\\b", as.list(setNames(replacements, patterns)), string)
## [1] "The slow black bear jumped over the lazy dog."
Run Code Online (Sandbox Code Playgroud)
这可以通过 stringi >= 0.3-1 通过使用参数设置为的stri_replace_*_all函数之一vectorize_all来完成FALSE:
library("stringi")
string <- "The quicker brown fox jumped over the lazy dog."
patterns <- c("quick", "brown", "fox")
replacements <- c("slow", "black", "bear")
stri_replace_all_fixed(string, patterns, replacements, vectorize_all=FALSE)
## [1] "The slower black bear jumped over the lazy dog."
stri_replace_all_regex(string, "\\b" %s+% patterns %s+% "\\b", replacements, vectorize_all=FALSE)
## [1] "The quicker black bear jumped over the lazy dog."
Run Code Online (Sandbox Code Playgroud)
一些基准:
string <- readLines("http://www.gutenberg.org/files/31536/31536-0.txt", encoding="UTF-8")
patterns <- c("jak", "to", "do", "z", "na", "i", "w", "za", "tu", "gdy",
"po", "jest", "Tadeusz", "lub", "razem", "nas", "przy", "oczy", "czy",
"sam", "u", "tylko", "bez", "ich", "Telimena", "Wojski", "jeszcze")
replacements <- paste0(patterns, rev(patterns))
microbenchmark::microbenchmark(
stri_replace_all_fixed(string, patterns, replacements, vectorize_all=FALSE),
stri_replace_all_regex(string, "\\b" %s+% patterns %s+% "\\b", replacements, vectorize_all=FALSE),
xxx_replace_xxx_pcre(string, "\\b" %s+% patterns %s+% "\\b", replacements),
gsubfn("\\b\\w+\\b", as.list(setNames(replacements, patterns)), string),
unit="relative",
times=25
)
## Unit: relative
## expr min lq mean median uq max neval
## stri_replace_all_fixed 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 25
## stri_replace_all_regex 2.169701 2.248115 2.198638 2.267935 2.267635 1.753289 25
## xxx_replace_xxx_pcre 1.983135 1.967303 1.937021 1.961449 1.974422 1.469894 25
## gsubfn 63.067835 69.870657 69.815031 71.178841 72.503020 57.019072 25
Run Code Online (Sandbox Code Playgroud)
因此,就仅在字边界匹配而言,基于 PCRE 的版本是最快的。