在 R 中打乱字符串元素的更好方法

Question

在 R 中打乱字符串元素的更好方法

我必须打乱字符串的元素。我写了一段代码：

sequ <- "GCTTCG"
set.seed(2017)
i <- sample(1:nchar(sequ))
separate.seq.letters <- unlist(strsplit(sequ, ""))
paste(separate.seq.letters[i], collapse = "")
[1] "GTCGTC"

Run Code Online (Sandbox Code Playgroud)

此代码将元素打乱一次。主要问题是有没有更好（更有效）的方法来做到这一点？对于非常长的序列和大量的 shuffle strsplit，paste命令需要一些额外的时间。

Answer 1

Sim*_*son 6

使用Rcpp包在 C 中处理可能是最快的。

下面我对迄今为止建议的一些方法进行了一些基准测试，包括：

问题中的方法
@akrun 在评论中的方法
使用 @knb 建议的 BIOSSTRINGS 包的方法
使用@Rich 建议的 STRINGI 包的方法
基于这篇文章的自定义 RCPP 函数。

除了 stringi 函数，下面是封装到函数中的其他函数用于测试：

f_question <- function(s) {
  i <- sample(1:nchar(s))
  separate.seq.letters <- unlist(strsplit(s, ""))
  paste(separate.seq.letters[i], collapse = "")
}

f_comment <- function(s) {
  s1 <- unlist(strsplit(s, ""))
  paste(s1[sample(nchar(s))], collapse="")
}

library(Biostrings)
f_biostring <- function(s) {
  probes <- DNAStringSet(s)
  lapply(probes, sample)
}

Rcpp::cppFunction(
  'std::string shuffleString(std::string s) {
    int x = s.length();
    for (int y = x; y > 0; y--) { 
      int pos = rand()%x;
      char tmp = s[y-1];
      s[y-1] = s[pos];
      s[pos] = tmp;
    }
    return s;
  }'
)

Run Code Online (Sandbox Code Playgroud)

为了测试，加载库和写入函数以生成长度为 n 的序列：

library(microbenchmark)
library(tidyr)
library(ggplot2)

generate_string <- function(n) {
  paste(sample(c("A", "C", "G", "T"), n, replace = TRUE), collapse = "")
}

sequ <- generate_string(10)

# Test example....

sequ
#> [1] "TTATCAAGGC"

f_question(sequ)
#> [1] "CATGGTACAT"
f_comment(sequ)
#> [1] "GATTATAGCC"
f_biostring(sequ)
#> [[1]]
#>   10-letter "DNAString" instance
#> seq: TAGATCGCAT
shuffleString(sequ)
#> [1] "GATTAATCGC"
stringi::stri_rand_shuffle(sequ)
#> [1] "GAAGTCCTTA"

Run Code Online (Sandbox Code Playgroud)

用小 n (10 - 100) 测试所有函数：

ns <- seq(10, 100, by = 10)
times <- sapply(ns, function(n) {
  string <- generate_string(n)

  op <- microbenchmark(
    QUESTION     = f_question(string),
    COMMENT      = f_comment(string),
    BIOSTRING    = f_biostring(string),
    RCPP         = shuffleString(string),
    STRINGI      = stringi::stri_rand_shuffle(string)
  )
  by(op$time, op$expr, function(t) mean(t) / 1000)
})
times <- t(times)
times <- as.data.frame(cbind(times, n = ns))

times <- gather(times, -n, key = "fun", value = "time")
pd <- position_dodge(width = 0.2)
ggplot(times, aes(x = n, y = time, group = fun, color = fun)) +
  geom_point(position = pd) +
  geom_line(position = pd) +
  theme_bw()

Run Code Online (Sandbox Code Playgroud)

Biostrings 方法非常缓慢。

删除它并移动到 100 - 1000（代码保持不变，除了ns）：

基于 R 的函数（来自问题和评论）具有可比性，但落后了。

删除这些并移动到 1000 - 10000：

看起来自定义 Rcpp 函数是赢家，特别是随着字符串长度的增长。但是，如果在这些之间进行选择，请考虑 stringi 函数stri_rand_shuffle将更加健壮（例如，经过更好的测试和设计以处理极端情况）。

Answer 2

Ric*_*ven 5

您可以stri_rand_shuffle()从stringi包中查看, 。它完全用 C 编写，应该非常高效。根据文档，它

生成每个字符串中代码点的（伪）随机排列。

让我们试试看：

replicate(5, stringi::stri_rand_shuffle("GCTTCG"))
# [1] "GTTCCG" "CCGTTG" "CTCTGG" "CCGGTT" "GTCGCT"

Run Code Online (Sandbox Code Playgroud)

归档时间：	8 年，8 月前
查看次数：	1872 次
最近记录：	8 年，2 月前