使用snp位置修改Sequence并在同一文件中输出

use*_*605 5 perl awk sed

我有两个文件,一个有位置信息,另一个是序列信息.现在我需要读取位置并在位置上取snps并用序列中的snp信息替换该位置基数并将其写入snp信息文件中...例如

Snp文件包含

10 A C A/C
Run Code Online (Sandbox Code Playgroud)

序列文件包含

ATCGAACTCTACATTAC
Run Code Online (Sandbox Code Playgroud)

这里第10个元素是T所以我将用[A/C]替换T,所以最终的输出应该是

10 A C A/C ATCGAACTC[A/C]ACATTAC
Run Code Online (Sandbox Code Playgroud)

示例文件是

Snp文件

SNP Ref Alt
10  A   C
19  G   C
30  C   T
42  A   G
Run Code Online (Sandbox Code Playgroud)

序列 :

序列1 CTAGAATCAAAGCAAGAATACACTCTTTTTTTTGGAAAAGAATATCTCATGTTTGCTCTT

最终输出:

SNP Ref Alt Output
10  A   C   CTAGAATCA[A/C]AGCAAGAANACACTCTTTTTTTTGGAAAAGAATATCTCATGTTTGCTCTT
19  G   C   CTAGAATCANAGCAAGAA[C/G]ACACTCTTTTTTTTGGAAAAGAATATCTCATGTTTGCTCTT
30  C   T   CTAGAATCAAAGCAAGAATACACTCTTTT[T/C]TTTGGAAAAGAATATCTCATGTTTGCTCTT
42  A   G   CTAGAATCAAAGCAAGAATACACTCTTTTTTTTGGAAAAGA[A/G]TATCTCATGTTTGCTCTT
Run Code Online (Sandbox Code Playgroud)

在从Ref和Alt列替换这里的snps时,我们需要考虑{A,T,C,G}的顺序,就像[Ref/Alt]一样,第一个基数应该是A或T或C,然后是订购.

另一件事是如果我们采取snp位置,并且如果有10个碱基差异的任何snps,我们需要用"N"替换该snp位置.在上面的例子中,前两个位置差异为9,我们用'N'替换另一个元素.

我已编写代码,按顺序打印位置,并用snp位置替换序列但无法读取附近位置并替换为N.

我的方法可能是完全错误的,因为我是编码的初学者.我认为通过使用哈希,我们可能很容易实现这一点,但我不太熟悉哈希..帮助请一些建议...我不必坚持只有perl ,

my $input_file = $ARGV[0];
my $snp_file = $ARGV[1];
my $output_file = $ARGV[2];

%sequence_hash = ();

open SNP, $snp_file || die $!;
$indel_count = 0;
$snp_count = 0;
$total_count = 0;

#### hashes and array
@id_array = ();

while ($line = <SNP>) {

    next if $line =~ /^No/;
    $line =~ s/\r|\n//g;


   # if ($line =~ /\tINDEL/) {

    #    $indel_count++;
     #   $snp_type = "INDEL";

    #} else {
     #   $snp_count++;
      #  $snp_type = "SNP";
    #}

    @parts = split(/\t/,$line);

    $id = $parts[0];
    $pos = $parts[1];
    #$ref_base = $parts[3];
    @temp_ref = split(",",$parts[2]);
    $ref_base = $temp_ref[0];
    @alt = split(",",$parts[3]);
    $alt_base = $alt[0];
    $snpformat = '';

    if ($ref_base eq "A" || $alt_base eq "A")
    {

        if ($ref_base eq "A"){
            $snpformat = "[".join("/",$ref_base,$alt_base)."]";}
            #$snpformat = $ref_base/$alt_base;}
            #print "refbase is A $ref_base \t $alt_base \t $snpformat \n"; }

            else 
            {$snpformat = "[".join("/",$alt_base,$ref_base)."]";}
            #print "Altbase is A $ref_base \t $alt_base \t $snpformat \n";}
    }

    elsif ($ref_base eq "T" || $alt_base eq "T")
    {

        if ($ref_base eq "T"){
            $snpformat = "[".join("/",$ref_base,$alt_base)."]";}
            #$snpformat = $ref_base/$alt_base;}
            #print "refbase is A $ref_base \t $alt_base \t $snpformat \n"; }

            else 
            {$snpformat = "[".join("/",$alt_base,$ref_base)."]";}
            #print "Altbase is A $ref_base \t $alt_base \t $snpformat \n";}
    }

    elsif ($ref_base eq "C" || $alt_base eq "C")
    {

        if ($ref_base eq "C"){
            $snpformat = "[".join("/",$ref_base,$alt_base)."]";}
            #$snpformat = $ref_base/$alt_base;}
            #print "refbase is A $ref_base \t $alt_base \t $snpformat \n"; }

            else 
            {$snpformat = "[".join("/",$alt_base,$ref_base)."]";}
            #print "Altbase is A $ref_base \t $alt_base \t $snpformat \n";}
    }


    else 
    {$snpformat = "-/-" ;}
    print " $id \t $pos \t $ref_base \t $alt_base \t $snpformat \n  ";
}

open SEQ, $input_file ||die $!;

$header = '';
$sequence = '';
$num_sequences = 0;

while ($line = <SEQ>) {

    $line =~ s/\r|\n//g;
    next if $line =~ //;

    if ($line =~ /^>(.+)/) {
        if ($header eq '') {

            $header = $1;
            $sequence = '';
            next;
        } else {

            $sequence_len = length($sequence);

            $sequence_hash{$header} = $sequence;
            push (@headers,$header);
            #print $header."\t".$sequence_len."\n";
            #print $sequence."\n";
            $num_sequences++;

            $header = $1;
            $sequence = '';

        }


    } else {

        $sequence .= $line;

    }

}
$sequence_len = length($sequence);
$sequence_hash{$header} = $sequence;
push (@headers,$header);
#print $header."\t".$sequence_len."\n";

$num_sequences++;

close (SEQ);

$pos = '4';
substr($sequence,$pos,1) = "[A/G]";
print $sequence."\n";   
print "$pos \n";
Run Code Online (Sandbox Code Playgroud)

Bet*_*eta 0

我不是 Perl 专家,但我认为这样做可以:

#!/usr/bin/perl

open(SEQ, "seq");
my $seq = <SEQ>;
$seq =~ s/.* //;

print "SNP Ref Alt Output\n";
open(SNP, "snp");
<SNP>;# header line
while(<SNP>)
{
    my($line) = $_;
    chomp($line);
    my ($loc, $ref, $alt) = split(/ +/, $line);
    my $outString = $seq;
    substr($outString, $loc-1, 1, "[$ref/$alt]");
    print $loc."  ".$ref."   ".$alt."   ".$outString."\n";
}
Run Code Online (Sandbox Code Playgroud)