chr*_*ler 13 string optimization r bioinformatics
我正在寻找一种更快的方法来计算从FASTA文件读入的DNA字符串的GC内容.这归结为取一个字符串并计算字母'G'或'C'出现的次数.我还想指定要考虑的字符范围.
我有一个相当慢的工作函数,它导致我的代码瓶颈.它看起来像这样:
##
## count the number of GCs in the characters between start and stop
##
gcCount <- function(line, st, sp){
chars = strsplit(as.character(line),"")[[1]]
numGC = 0
for(j in st:sp){
##nested ifs faster than an OR (|) construction
if(chars[[j]] == "g"){
numGC <- numGC + 1
}else if(chars[[j]] == "G"){
numGC <- numGC + 1
}else if(chars[[j]] == "c"){
numGC <- numGC + 1
}else if(chars[[j]] == "C"){
numGC <- numGC + 1
}
}
return(numGC)
}
Run Code Online (Sandbox Code Playgroud)
运行Rprof给我以下输出:
> a = "GCCCAAAATTTTCCGGatttaagcagacataaattcgagg"
> Rprof(filename="Rprof.out")
> for(i in 1:500000){gcCount(a,1,40)};
> Rprof(NULL)
> summaryRprof(filename="Rprof.out")
self.time self.pct total.time total.pct
"gcCount" 77.36 76.8 100.74 100.0
"==" 18.30 18.2 18.30 18.2
"strsplit" 3.58 3.6 3.64 3.6
"+" 1.14 1.1 1.14 1.1
":" 0.30 0.3 0.30 0.3
"as.logical" 0.04 0.0 0.04 0.0
"as.character" 0.02 0.0 0.02 0.0
$by.total
total.time total.pct self.time self.pct
"gcCount" 100.74 100.0 77.36 76.8
"==" 18.30 18.2 18.30 18.2
"strsplit" 3.64 3.6 3.58 3.6
"+" 1.14 1.1 1.14 1.1
":" 0.30 0.3 0.30 0.3
"as.logical" 0.04 0.0 0.04 0.0
"as.character" 0.02 0.0 0.02 0.0
$sampling.time
[1] 100.74
Run Code Online (Sandbox Code Playgroud)
有关使此代码更快的建议吗?
Ken*_*ams 14
最好不要拆分,只计算匹配:
gcCount2 <- function(line, st, sp){
sum(gregexpr('[GCgc]', substr(line, st, sp))[[1]] > 0)
}
Run Code Online (Sandbox Code Playgroud)
这要快一个数量级.
只是迭代字符的小C函数将更快一个数量级.
归档时间: |
|
查看次数: |
9605 次 |
最近记录: |