1 algorithm perl bioinformatics alignment dna-sequence
这段代码我需要一些帮助.我知道应该递归的部分,或者至少我认为我这样做但不确定如何实现它.我正在尝试从对齐矩阵实现路径查找程序,该矩阵将找到多个路由回零值.例如,如果您执行我的代码并插入CGCA作为第一个序列,并将CACGTAT作为第二个序列插入,并将1,0和-1作为匹配,不匹配和差距分数.该程序给出了HDHHDD和对象的路径
CACGTAT
CGC - A-.
然而,除此之外,还有更多可能的路径和对象,除了我不知道有多少.我想要做的是让我的代码循环回到自身并找到其他路径和对齐,使用与第一次相同的代码,直到它用完可能的对齐.我在网上找到的最好的方法是递归,除了没有人能解释如何做到这一点.在这种情况下,应该有两个以上的路径和对话框HDDDHHD和CACGTAT,以及C - GCA-和.HDDDDHH,CACGTAT和--CGCA-.我只是不知道如何编写代码来执行此任务.
# Implementation of Needleman and Wunsch Algorithm
my($seq1, $len1, $seq2, $len2, $data, @matrix, $i, $j, $x, $y, $val1, $val2);
my($val3, $pathrow, $pathcol, $seq1loc, $seq2loc, $gapscore, $matchscore, $mismatchscore);
#first obtain the data from the user.
print "Please enter the first sequence for comaprsion\n";
$seq1=<STDIN>;
chomp $seq1;
print "Please enter the second sequence for comparsion\n";
$seq2=<STDIN>;
chomp $seq2;
# adding extra characters so sequences align with matrix
# saves some calculations later on
$seq1 = " " . $seq1;
$seq2 = " " . $seq2;
$len1 = length($seq1);
$len2 = length($seq2);
print "Enter the match score: ";
$matchscore=<STDIN>;
chomp $matchscore;
print "Enter the mismatch score: ";
$mismatchscore=<STDIN>;
chomp $mismatchscore;
print "Enter the gap score: ";
$gapscore=<STDIN>;
chomp $gapscore;
# declare a two dimensional array and initialize to spaces
# array must contain one extra row and one extra column
@matrix = ();
for($i = 0; $i < $len1; $i++){
for($j = 0; $j < $len2; $j++){
$matrix[$i][$j] = ' ';
}
}
# initialize 1st row and 1st column of matrix
$matrix[0][0] = 0;
for ($i = 1; $i < $len1; $i ++){
$matrix[$i][0] = $matrix[$i-1][0] + $gapscore;
}
for ($i = 1; $i < $len2; $i ++){
$matrix[0][$i] = $matrix[0][$i-1] + $gapscore;
}
# STEP 1:
# Fill in rest of matrix using the following rules:
# determine three possible values for matrix[x][y]
# value 1 = add gap score to matrix[x][y-1]
# value 2 = add gap score to matrix[x-1][y]
# value 3 = add match score or mismatch score to
# matrix[x-1][y-1] depending on nucleotide
# match for position x of $seq1 and position y
# of seq2
# place the largest of the three values in matrix[x][y]
#
# Best alignment score appears in matrix[$len1][$len2].
for($x = 1; $x < $len1; $x++){
for($y = 1; $y < $len2; $y++){
$val1 = $matrix[$x][$y-1] + $gapscore;
$val2 = $matrix[$x-1][$y] + $gapscore;
if (substr($seq1, $x, 1) eq substr($seq2, $y, 1)){
$val3 = $matrix[$x-1][$y-1] + $matchscore;
}
else{
$val3 = $matrix[$x-1][$y-1] + $mismatchscore;
}
if (($val1 >= $val2) && ($val1 >= $val3)){
$matrix[$x][$y] = $val1;
}
elsif (($val2 >= $val1) && ($val2 >= $val3)){
$matrix[$x][$y] = $val2;
}
else{
$matrix[$x][$y] = $val3;
}
}
}
# Display scoring matrix
print "MATRIX:\n";
for($x = 0; $x < $len1; $x++){
for($y = 0; $y < $len2; $y++){
print "$matrix[$x][$y] ";
}
print "\n";
}
print "\n";
# STEP 2:
# Begin at matrix[$len1][$len2] and find a path to
# matrix[0][0].
# Build string to hold path pattern by concatenating either
# 'H' (current cell produced by cell horizontally to left),
# 'D' (current cell produced by cell on diagonal),
# 'V' (current cell produced by cell vertically above)
# ***This is were I need help I need this code to be recursive, so I can find more then one path***
$pathrow = $len1-1;
$pathcol = $len2-1;
while (($pathrow != 0) || ($pathcol != 0)){
if ($pathrow == 0){
# must be from cell to left
$path = $path . 'H';
$pathcol = $pathcol - 1;
}
elsif ($pathcol == 0){
# must be from cell above
$path = $path . 'V';
$pathrow = $pathrow - 1;
}
# could be from any direction
elsif (($matrix[$pathrow][$pathcol-1] + $gapscore) == $matrix[$pathrow][$pathcol]){
# from left
$path = $path . 'H';
$pathcol = $pathcol - 1;
}
elsif (($matrix[$pathrow-1][$pathcol] + $gapscore) == $matrix[$pathrow][$pathcol]){
# from above
$path = $path . 'V';
$pathrow = $pathrow - 1;
}
else{
# must be from diagonal
$path = $path . 'D';
$pathrow = $pathrow - 1;
$pathcol = $pathcol - 1;
}
}
print "Path is $path\n";
# STEP 3:
# Determine alignment pattern by reading path string
# created in step 2.
# Create two string variables ($alignseq1 and $alignseq2) to hold
# the sequences for alignment.
# Reading backwards from $seq1, $seq2 and path string,
# if string character is 'D', then
# concatenate to front of $alignseq1, last char in $seq1
# and to the front of $alignseq2, last char in $seq2
# if string character is 'V', then
# concatenate to front of $alignseq1, last char in $seq1
# and to the front of $alignseq2, the gap char
# if string character is 'H', then
# concatenate to front of $alignseq1 the gap char
# and to the front of $alignseq2, last char in $seq2
# Continue process until path string has been traversed.
# Result appears in string $alignseq1 and $seq2
***#I need this code to be recursive as well.***
$seq1loc = $len1-1;
$seq2loc = $len2-1;
$pathloc = 0;
print length($path);
while ($pathloc < length($path)){
if (substr($path, $pathloc, 1) eq 'D'){
$alignseq1 = substr($seq1, $seq1loc, 1) . $alignseq1;
$alignseq2 = substr($seq2, $seq2loc, 1) . $alignseq2;
$seq1loc--;
$seq2loc--;
}
elsif (substr($path, $pathloc, 1) eq 'V'){
$alignseq1 = substr($seq1, $seq1loc, 1) . $alignseq1;
$alignseq2 = '-' . $alignseq2;
$seq1loc--;
}
else{ # must be an H
$alignseq1 = '-' . $alignseq1;
$alignseq2 = substr($seq2, $seq2loc, 1) . $alignseq2;
$seq2loc--;
}
$pathloc++;
}
print "\nAligned Sequences:\n";
print "$alignseq2 \n";
print "$alignseq1 \n";
# statement may be needed to hold output screen
print "Press any key to exit program";
$x = <STDIN>;
Run Code Online (Sandbox Code Playgroud)
如果有人想知道这是一个needleman-wunsch算法.这里的任何帮助都会得到很大的帮助.
我无法提供答案,因为我不确切地知道你要做什么,但我可以提供一些一般的建议.
开始将代码组织到执行狭义定义任务的离散子例程中.此外,实现中央算法的子程序不应该面向从键盘接收输入并产生输出到屏幕; 相反,他们应该接收输入作为参数并返回他们的结果.如果需要用户输入或屏幕输出,那些任务应该在单独的子程序中,而不是与主算法混合.
该路径的第一个(和部分)步骤是将整个程序带入,将其包含在子例程定义中,然后使用所需的参数调用子例程.除了打印其主要成果,子程序应该回到他们-具体而言,一提到@matrix与该值一起$path,$alignseq1,$alignseq2.
sub NW_algo {
my ($seq1, $seq2, $matchscore, $mismatchscore, $gapscore) = @_;
# The rest of your code here, but with all print
# statements and <STDIN> inputs commented out.
return \@matrix, $path, $alignseq1, $alignseq2;
}
my(@return_values) = NW_algo('CGCA', 'CACGTAT', 1, 0, -1);
Print_matrix($return_values[0]);
sub Print_matrix {
for my $m ( @{$_[0]} ){
print join(' ', @$m), "\n";
}
}
Run Code Online (Sandbox Code Playgroud)
此时,您将拥有一个可以被其他代码调用的算法,从而可以更轻松地测试和调试您的程序.例如,您可以定义各种输入数据集并NW_algo()在每个集上运行.只有这样才能考虑递归或其他技术.
由于Needleman-Wunsch是一种动态编程算法,因此大部分工作已经在您计算DP矩阵时完成.一旦你有了DP矩阵,你应该回溯矩阵以找到最佳对齐.问题有点像出租车几何,除了你可以对角移动.基本上,当你需要通过矩阵回溯时,不是在向上,向左或对角线之间进行选择,而是通过进行三次递归调用来完成所有三个,并且每个调用自己为上,左或对角线中的每一个调用自己,直到你到达了起点.每个递归链跟踪的路径将绘制每个对齐.
编辑:所以基本上你需要把步骤2放在一个子程序(占据位置和到目前为止的路径),所以它可以一遍又一遍地调用自己.当然,在定义过程之后,您需要调用它来实际启动跟踪过程:
sub tracePaths {
$x = shift;
$y = shift;
$pathSoFar = shift; # assuming you're storing your path as a string
#
# ... do some work ...
#
tracePaths($x - 1, $y, $pathSoFar . something);
tracePaths($x, $y - 1, $pathSoFar . somethingelse);
tracePaths($x - 1, $y - 1, $pathSoFar . somethingelselse);
#
#
#
if(reached the end) return $pathSoFar;
}
# launch the recursion
tracePaths(beginningx, beginningy, "");
Run Code Online (Sandbox Code Playgroud)