我有两个文件,一个有位置信息,另一个是序列信息.现在我需要读取位置并在位置上取snps并用序列中的snp信息替换该位置基数并将其写入snp信息文件中...例如
Snp文件包含
10 A C A/C
Run Code Online (Sandbox Code Playgroud)
序列文件包含
ATCGAACTCTACATTAC
Run Code Online (Sandbox Code Playgroud)
这里第10个元素是T所以我将用[A/C]替换T,所以最终的输出应该是
10 A C A/C ATCGAACTC[A/C]ACATTAC
Run Code Online (Sandbox Code Playgroud)
示例文件是
Snp文件
SNP Ref Alt
10 A C
19 G C
30 C T
42 A G
Run Code Online (Sandbox Code Playgroud)
序列 :
序列1 CTAGAATCAAAGCAAGAATACACTCTTTTTTTTGGAAAAGAATATCTCATGTTTGCTCTT
最终输出:
SNP Ref Alt Output
10 A C CTAGAATCA[A/C]AGCAAGAANACACTCTTTTTTTTGGAAAAGAATATCTCATGTTTGCTCTT
19 G C CTAGAATCANAGCAAGAA[C/G]ACACTCTTTTTTTTGGAAAAGAATATCTCATGTTTGCTCTT
30 C T CTAGAATCAAAGCAAGAATACACTCTTTT[T/C]TTTGGAAAAGAATATCTCATGTTTGCTCTT
42 A G CTAGAATCAAAGCAAGAATACACTCTTTTTTTTGGAAAAGA[A/G]TATCTCATGTTTGCTCTT
Run Code Online (Sandbox Code Playgroud)
在从Ref和Alt列替换这里的snps时,我们需要考虑{A,T,C,G}的顺序,就像[Ref/Alt]一样,第一个基数应该是A或T或C,然后是订购.
另一件事是如果我们采取snp位置,并且如果有10个碱基差异的任何snps,我们需要用"N"替换该snp位置.在上面的例子中,前两个位置差异为9,我们用'N'替换另一个元素.
我已编写代码,按顺序打印位置,并用snp位置替换序列但无法读取附近位置并替换为N.
我的方法可能是完全错误的,因为我是编码的初学者.我认为通过使用哈希,我们可能很容易实现这一点,但我不太熟悉哈希..帮助请一些建议...我不必坚持只有perl ,
my $input_file = $ARGV[0];
my $snp_file = $ARGV[1];
my $output_file = $ARGV[2];
%sequence_hash = ();
open SNP, $snp_file || die $!;
$indel_count = 0;
$snp_count = 0;
$total_count = 0;
#### hashes and array
@id_array = ();
while ($line = <SNP>) {
next if $line =~ /^No/;
$line =~ s/\r|\n//g;
# if ($line =~ /\tINDEL/) {
# $indel_count++;
# $snp_type = "INDEL";
#} else {
# $snp_count++;
# $snp_type = "SNP";
#}
@parts = split(/\t/,$line);
$id = $parts[0];
$pos = $parts[1];
#$ref_base = $parts[3];
@temp_ref = split(",",$parts[2]);
$ref_base = $temp_ref[0];
@alt = split(",",$parts[3]);
$alt_base = $alt[0];
$snpformat = '';
if ($ref_base eq "A" || $alt_base eq "A")
{
if ($ref_base eq "A"){
$snpformat = "[".join("/",$ref_base,$alt_base)."]";}
#$snpformat = $ref_base/$alt_base;}
#print "refbase is A $ref_base \t $alt_base \t $snpformat \n"; }
else
{$snpformat = "[".join("/",$alt_base,$ref_base)."]";}
#print "Altbase is A $ref_base \t $alt_base \t $snpformat \n";}
}
elsif ($ref_base eq "T" || $alt_base eq "T")
{
if ($ref_base eq "T"){
$snpformat = "[".join("/",$ref_base,$alt_base)."]";}
#$snpformat = $ref_base/$alt_base;}
#print "refbase is A $ref_base \t $alt_base \t $snpformat \n"; }
else
{$snpformat = "[".join("/",$alt_base,$ref_base)."]";}
#print "Altbase is A $ref_base \t $alt_base \t $snpformat \n";}
}
elsif ($ref_base eq "C" || $alt_base eq "C")
{
if ($ref_base eq "C"){
$snpformat = "[".join("/",$ref_base,$alt_base)."]";}
#$snpformat = $ref_base/$alt_base;}
#print "refbase is A $ref_base \t $alt_base \t $snpformat \n"; }
else
{$snpformat = "[".join("/",$alt_base,$ref_base)."]";}
#print "Altbase is A $ref_base \t $alt_base \t $snpformat \n";}
}
else
{$snpformat = "-/-" ;}
print " $id \t $pos \t $ref_base \t $alt_base \t $snpformat \n ";
}
open SEQ, $input_file ||die $!;
$header = '';
$sequence = '';
$num_sequences = 0;
while ($line = <SEQ>) {
$line =~ s/\r|\n//g;
next if $line =~ //;
if ($line =~ /^>(.+)/) {
if ($header eq '') {
$header = $1;
$sequence = '';
next;
} else {
$sequence_len = length($sequence);
$sequence_hash{$header} = $sequence;
push (@headers,$header);
#print $header."\t".$sequence_len."\n";
#print $sequence."\n";
$num_sequences++;
$header = $1;
$sequence = '';
}
} else {
$sequence .= $line;
}
}
$sequence_len = length($sequence);
$sequence_hash{$header} = $sequence;
push (@headers,$header);
#print $header."\t".$sequence_len."\n";
$num_sequences++;
close (SEQ);
$pos = '4';
substr($sequence,$pos,1) = "[A/G]";
print $sequence."\n";
print "$pos \n";
Run Code Online (Sandbox Code Playgroud)
我不是 Perl 专家,但我认为这样做可以:
#!/usr/bin/perl
open(SEQ, "seq");
my $seq = <SEQ>;
$seq =~ s/.* //;
print "SNP Ref Alt Output\n";
open(SNP, "snp");
<SNP>;# header line
while(<SNP>)
{
my($line) = $_;
chomp($line);
my ($loc, $ref, $alt) = split(/ +/, $line);
my $outString = $seq;
substr($outString, $loc-1, 1, "[$ref/$alt]");
print $loc." ".$ref." ".$alt." ".$outString."\n";
}
Run Code Online (Sandbox Code Playgroud)