如何在R中有效地对字符串中的字母重新排序？

Question

如何在R中有效地对字符串中的字母重新排序？

我有以下函数来重新排序字符向量中的字母。

reorder_letter <- function(x){
  sapply(strsplit(x,split = ""),function(x) paste(sort(toupper(x)),collapse = ""))
}

reorder_letter(c("trErty","Bca","def"))
#> [1] "ERRTTY" "ABC"    "DEF"

Run Code Online (Sandbox Code Playgroud)

^{由reprex 包(v0.3.0)于 2020 年 4 月 29 日创建}

基本上我想返回字符的相同字母，但使用大写和排序顺序。

目前，运行 150 万个长度的向量大约需要 1 分钟。

编辑：我还尝试使用future.apply比基本 R 解决方案快 3 倍的包进行并行化（也很容易修改当前代码）

reorder_letter <- function(x){
  future_sapply(strsplit(x,split = ""),function(x) paste(sort(toupper(x)),collapse = ""))
}

Run Code Online (Sandbox Code Playgroud)

我只是好奇

我怎样才能有效地达到我的目的？
找到函数瓶颈的最佳方法是什么？例如，我已经完成了这个功能。下一步做什么？

Answer 1

GKi*_*GKi 7

也许utf8ToIntandintToUtf8比strsplitand快paste。

x <- c("trErty","Bca","def")
unlist(lapply(x, function(y) {intToUtf8(sort(utf8ToInt(toupper(y))))}))
#[1] "ERRTTY" "ABC"    "DEF"

Run Code Online (Sandbox Code Playgroud)

时代：（它不是更快......对不起）

但是stringi更快，编写函数 C++ 甚至更快（可能会改进，但它已经快了 10 倍）。

FrankZhang <- function(x) {
  unlist(lapply(strsplit(toupper(x),NULL),function(x) paste(sort(x),collapse = "")))}
GKi <- function(x) {
  unlist(lapply(toupper(x), function(y) {intToUtf8(sort(utf8ToInt(y)))}))
}
library(stringi)
stringi <- function(y) {
  vapply(stri_split_boundaries(toupper(y), type = "character"), function(x) stri_c(x[stri_order(x)], collapse = ""), "")
}
library(Rcpp)
cppFunction("std::string GKiC(std::string &str) {
  std::sort(str.begin(), str.end());
  return(str);}")
GKi2 <- function(x) {unlist(lapply(toupper(x), GKiC))}

x <- apply(expand.grid(letters, LETTERS), 1, paste, collapse = "")
microbenchmark::microbenchmark(FrankZhang(x), GKi(x), stringi(x), GKi2(x), control=list(order="block"))
#Unit: milliseconds
#          expr       min        lq      mean    median        uq       max neval  cld
# FrankZhang(x) 17.533428 18.686879 20.380002 19.719311 21.014381 33.836692   100    d
#        GKi(x) 16.551358 17.665436 18.656223 18.271688 19.343088 23.225199   100   c 
#    stringi(x)  4.644196  4.844622  5.082298  5.011344  5.237714  7.355251   100  b  
#       GKi2(x)  1.527124  1.624337  1.997725  1.691099  2.242797  5.593543   100 a

Run Code Online (Sandbox Code Playgroud)

要找出使用大量计算时间的内容，您可以使用Rprof例如：

reorder_letter <- function(x) { #Function
  sapply(strsplit(x,split = ""),function(x) paste(sort(toupper(x)),collapse = ""))}
x <- apply(expand.grid(letters, LETTERS, letters), 1, paste, collapse = "") #Data

Rprof()
y <- reorder_letter(x)
Rprof(NULL)
summaryRprof()
#$by.self
#               self.time self.pct total.time total.pct
#"FUN"               0.12    20.69       0.54     93.10
#"sort.int"          0.10    17.24       0.22     37.93
#"paste"             0.08    13.79       0.42     72.41
#"sort"              0.06    10.34       0.34     58.62
#"sort.default"      0.06    10.34       0.28     48.28
#"match.arg"         0.04     6.90       0.10     17.24
#"eval"              0.04     6.90       0.04      6.90
#"sapply"            0.02     3.45       0.58    100.00
#"lapply"            0.02     3.45       0.56     96.55
#".doSortWrap"       0.02     3.45       0.02      3.45
#"formals"           0.02     3.45       0.02      3.45
#
#$by.total
#                 total.time total.pct self.time self.pct
#"sapply"               0.58    100.00      0.02     3.45
#"reorder_letter"       0.58    100.00      0.00     0.00
#"lapply"               0.56     96.55      0.02     3.45
#"FUN"                  0.54     93.10      0.12    20.69
#"paste"                0.42     72.41      0.08    13.79
#"sort"                 0.34     58.62      0.06    10.34
#"sort.default"         0.28     48.28      0.06    10.34
#"sort.int"             0.22     37.93      0.10    17.24
#"match.arg"            0.10     17.24      0.04     6.90
#"eval"                 0.04      6.90      0.04     6.90
#".doSortWrap"          0.02      3.45      0.02     3.45
#"formals"              0.02      3.45      0.02     3.45
#
#$sample.interval
#[1] 0.02
#
#$sampling.time
#[1] 0.58

Run Code Online (Sandbox Code Playgroud)

归档时间：	5 年，10 月前
查看次数：	127 次
最近记录：	5 年，9 月前