ned*_*orf 8 c++ algorithm bioinformatics
"kmer"是长度为K的DNA序列.有效的DNA序列(为了我的目的)只能含有以下4个碱基:A,C,T,G.我正在寻找一种C++算法,它只是按字母顺序将这些碱基的所有可能组合输出到字符串数组中.例如,如果K = 2,程序应该生成以下数组:
kmers[0] = AA
kmers[1] = AC
kmers[2] = AG
kmers[3] = AT
kmers[4] = CA
kmers[5] = CC
kmers[6] = CG
kmers[7] = CT
kmers[8] = GA
kmers[9] = GC
kmers[10] = GG
kmers[11] = GT
kmers[12] = TA
kmers[13] = TC
kmers[14] = TG
kmers[15] = TT
Run Code Online (Sandbox Code Playgroud)
如果我正确地考虑这个问题,那么问题实际上就会分解为将十进制整数转换为基数4然后替换相应的基数.我以为我可以使用itoa,但itoa不是C标准,我的编译器不支持它.我欢迎任何聪明的想法.这是我的示例代码:
#include <iostream>
#include <string>
#include <math.h>
#define K 3
using namespace std;
int main() {
int num_kmers = pow(4,K);
string* kmers = NULL;
/* Allocate memory for kmers array */
kmers = new string[num_kmers];
/* Populate kmers array */
for (int i=0; i< pow(4,K); i++) {
// POPULATE THE kmers ARRAY HERE
}
/* Display all possible kmers */
for (int i=0; i< pow(4,K); i++)
cout << kmers[i] << "\n";
delete [] kmers;
}
Run Code Online (Sandbox Code Playgroud)
您需要使用递归灵活(即,以便您可以轻松地更改K).
void populate(int depth, string base, string* kmers, int* kmers_offset)
{
if(depth == K)
{
kmers[*kmers_offset].assign(base);
(*kmers_offset)++;
}
else
{
static char bases[] = { 'A', 'C', 'G', 'T' };
for(int i = 0; i < 4; ++i)
populate(depth + 1, base + bases[i], kmers, kmers_offset);
}
}
Run Code Online (Sandbox Code Playgroud)
然后像这样调用它:
int kmers_offset = 0;
populate(0, "", kmers, &kmers_offset);
Run Code Online (Sandbox Code Playgroud)
干杯.