使用R分割字符串和计算字符的速度更快?

chr*_*ler 13 string optimization r bioinformatics

我正在寻找一种更快的方法来计算从FASTA文件读入的DNA字符串的GC内容.这归结为取一个字符串并计算字母'G'或'C'出现的次数.我还想指定要考虑的字符范围.

我有一个相当慢的工作函数,它导致我的代码瓶颈.它看起来像这样:

##
## count the number of GCs in the characters between start and stop
##
gcCount <-  function(line, st, sp){
  chars = strsplit(as.character(line),"")[[1]]
  numGC = 0
  for(j in st:sp){
    ##nested ifs faster than an OR (|) construction
    if(chars[[j]] == "g"){
      numGC <- numGC + 1
    }else if(chars[[j]] == "G"){
      numGC <- numGC + 1
    }else if(chars[[j]] == "c"){
      numGC <- numGC + 1
    }else if(chars[[j]] == "C"){
      numGC <- numGC + 1
    }
  }
  return(numGC)
}
Run Code Online (Sandbox Code Playgroud)

运行Rprof给我以下输出:

> a = "GCCCAAAATTTTCCGGatttaagcagacataaattcgagg"
> Rprof(filename="Rprof.out")
> for(i in 1:500000){gcCount(a,1,40)};
> Rprof(NULL)
> summaryRprof(filename="Rprof.out")

                   self.time self.pct total.time total.pct
"gcCount"          77.36     76.8     100.74     100.0
"=="               18.30     18.2      18.30      18.2
"strsplit"          3.58      3.6       3.64       3.6
"+"                 1.14      1.1       1.14       1.1
":"                 0.30      0.3       0.30       0.3
"as.logical"        0.04      0.0       0.04       0.0
"as.character"      0.02      0.0       0.02       0.0

$by.total
               total.time total.pct self.time self.pct
"gcCount"          100.74     100.0     77.36     76.8
"=="                18.30      18.2     18.30     18.2
"strsplit"           3.64       3.6      3.58      3.6
"+"                  1.14       1.1      1.14      1.1
":"                  0.30       0.3      0.30      0.3
"as.logical"         0.04       0.0      0.04      0.0
"as.character"       0.02       0.0      0.02      0.0

$sampling.time
[1] 100.74
Run Code Online (Sandbox Code Playgroud)

有关使此代码更快的建议吗?

Ken*_*ams 14

最好不要拆分,只计算匹配:

gcCount2 <-  function(line, st, sp){
  sum(gregexpr('[GCgc]', substr(line, st, sp))[[1]] > 0)
}
Run Code Online (Sandbox Code Playgroud)

这要快一个数量级.

只是迭代字符的小C函数将更快一个数量级.


rgu*_*uha 6

一个班轮:

table(strsplit(toupper(a), '')[[1]])
Run Code Online (Sandbox Code Playgroud)