文件 1(仅包含选定的 ID):
AAX50250
AAX50251
AAX50252
AAX50253
AAX50254
AAX50257
AAX50258
Run Code Online (Sandbox Code Playgroud)
文件 2(包含带有 ID 的序列)。这是一个 FASTA 文件,其中以 as 开头的行>是序列 ID,后续行是序列本身。
> AAX50250.1 cds:annotated chromosome:ASM1212v1:Chromosome:1:1770:1 gene:CTA_0001.1 gene_biotype:protein_coding transcript_biotype:protein_coding description:hypothetical protein
ATGAGCATCAGGGGAGTAGGAGGCAACGGGAATAGTCGAATCCCTTCTCATAATGGGGAT
GGATCGAATCGCAGAAGTCAAAATACGAAGAATAAAGTTGAAGATCGAGTTCGTTCTCTA
TATTCATCTCGTAGTAACGAAAATAGAGAATCTCCTTATGCAGTAGTAGACGTCAGCTCT
> AAX50251.1 cds:annotated chromosome:ASM1212v1:Chromosome:1915:2187:-1 gene:CTA_0002.1 gene_biotype:protein_coding transcript_biotype:protein_coding description:hypothetical membrane associated protein
ATGCTTTGTAAAGTTTGTAGAGGATTATCTTCTCTTATTGTTGTTCTCGGAGCTATAAAC
ACTGGAATTTTAGGAGTAACAGGGTATAAGGTAAACCTACTTACTCACCTGCTTGGTGAA
GGAACCATGTGGACACAAGCAGCTTATGTAGTAACGGGAATCGCTGGGGTTATGGTCTGC
> AAX50252.1 cds:annotated chromosome:ASM1212v1:Chromosome:2388:2690:1 gene:CTA_0003.1 gene_biotype:protein_coding transcript_biotype:protein_coding gene_symbol:gatC description:glutamyl-tRNA(Gln) amidotransferase subunit C
ATGACAGAGTCATATGTAAACAAAGAAGAAATCATCTCTTTAGCAAAGAATGCTGCATTG
GAGTTGGAAGATGCCCACGTGGAAGAGTTCGTAACATCTATGAATGACGTCATTGCTTTA
ATGCAGGAAGTAATCGCGATAGATATTTCGGATATCATTCTTGAAGCTACAGTGCATCAT
TTCGTTGGTCCAGAGGATCTTAGAGAAGACATGGTGACTTCGGATTTTACTCAAGAAGAA
TAG
> AAX50262.1 cds:annotated chromosome:ASM1212v1:Chromosome:13318:13932:-1 gene:CTA_0013.1 gene_biotype:protein_coding transcript_biotype:protein_coding gene_symbol:ybbP description:hypothetical membrane spanning protein
ATGTTCGTAGGTATAACGTATTACACCACACCTCTGTTGGAGATAGCTTTAATTTGGGTG
GTCCTTAATTATTTGCTAAAGTTTTTCTGGGGAACAGGCGCCATGGACCTCGTCTTTGGC
TTGTTGTCTTTTCTTTGCCTATTTGTTCTAGCAGAAAAACTTCATCTCCCCGTTATTCGC
AATTTGATGCTTCATGTAGTGAATATTGCGGCTATCGTGGTATTTATTATCTTCCAACCA
GAAATTCGCCTTGCTCTCTCTAGGATACGCTTGTGTAGAGAGAAATTTGTCATCAATATG
Run Code Online (Sandbox Code Playgroud)
所以现在我必须将包含所选 ID 的文件 1 与具有 n 个序列的文件 2 进行比较。我们需要输出与文件 1 中的 ID 关联的序列。
例如,AAX50250文件 1 中的 ID应返回:
>AAX50250.1 cds:annotated chromosome:ASM1212v1:Chromosome:1:1770:1 gene:CTA_0001.1 gene_biotype:protein_coding transcript_biotype:protein_coding description:hypothetical protein
ATGAGCATCAGGGGAGTAGGAGGCAACGGGAATAGTCGAATCCCTTCTCATAATGGGGAT
GGATCGAATCGCAGAAGTCAAAATACGAAGAATAAAGTTGAAGATCGAGTTCGTTCTCTA
TATTCATCTCGTAGTAACGAAAATAGAGAATCTCCTTATGCAGTAGTAGACGTCAGCTCT
ATGATCGAGAGCACCCCAACGAGTGGAGAGACGACAAGAGCTTCGCGTGGAGTATTCAGT
CGTTTCCAAAGAGGTTTAGGACGAGTAGCTGACAAAGTAAGACGAGCTGTTCAGCGTGCG
TGGAGTTCAGTCTCTATAAGAAGATCGTCTGCAACAAGAGCCGCAGAATCCAGATCAAGT
Run Code Online (Sandbox Code Playgroud)
我怎样才能做到这一点?
您可以编写一个脚本来直接执行此操作,但我建议您改用下面的两个小脚本。我花了很多年时间处理这类数据,它们非常非常有用。如果您打算进行大量 FASTA 文件操作,我建议您使用它们。
FastaToTbl:
#!/usr/bin/awk -f
{
if (substr($1,1,1)==">")
if (NR>1)
printf "\n%s\t", substr($0,2,length($0)-1)
else
printf "%s\t", substr($0,2,length($0)-1)
else
printf "%s", $0
}END{printf "\n"}
Run Code Online (Sandbox Code Playgroud)TblToFasta:
#! /usr/bin/awk -f
{
sequence=$NF
ls = length(sequence)
is = 1
fld = 1
while (fld < NF)
{
if (fld == 1){printf ">"}
printf "%s " , $fld
if (fld == NF-1)
{
printf "\n"
}
fld = fld+1
}
while (is <= ls)
{
printf "%s\n", substr(sequence,is,60)
is=is+60
}
}
Run Code Online (Sandbox Code Playgroud)将上面的脚本保存在您的$HOME/bin目录中(如果它不存在,请创建它然后注销并重新登录以将其添加到您的$PATH. 然后使用chmod a+x ~/bin/TblToFasta ~/bin/FastaToTbl.
在FastaToTbl将FASTA序列文件转换为TBL格式(序列ID,然后是制表符和该序列中,所有在一条线)。一旦它们采用 tbl 格式并且您在同一行上拥有 ID 和序列,您就可以使用它grep来搜索您的 ID。然后,您将输出通过TblToFastatoi 转换回 FASTA:
$ FastaToTbl file2 | grep -wFf file1 | TblToFasta
>AAX50250.1 cds:annotated chromosome:ASM1212v1:Chromosome:1:1770:1 gene:CTA_0001.1 gene_biotype:protein_coding transcript_biotype:protein_coding description:hypothetical protein
ATGAGCATCAGGGGAGTAGGAGGCAACGGGAATAGTCGAATCCCTTCTCATAATGGGGAT
GGATCGAATCGCAGAAGTCAAAATACGAAGAATAAAGTTGAAGATCGAGTTCGTTCTCTA
TATTCATCTCGTAGTAACGAAAATAGAGAATCTCCTTATGCAGTAGTAGACGTCAGCTCT
>AAX50251.1 cds:annotated chromosome:ASM1212v1:Chromosome:1915:2187:-1 gene:CTA_0002.1 gene_biotype:protein_coding transcript_biotype:protein_coding description:hypothetical membrane associated protein
ATGCTTTGTAAAGTTTGTAGAGGATTATCTTCTCTTATTGTTGTTCTCGGAGCTATAAAC
ACTGGAATTTTAGGAGTAACAGGGTATAAGGTAAACCTACTTACTCACCTGCTTGGTGAA
GGAACCATGTGGACACAAGCAGCTTATGTAGTAACGGGAATCGCTGGGGTTATGGTCTGC
>AAX50252.1 cds:annotated chromosome:ASM1212v1:Chromosome:2388:2690:1 gene:CTA_0003.1 gene_biotype:protein_coding transcript_biotype:protein_coding gene_symbol:gatC description:glutamyl-tRNA(Gln) amidotransferase subunit C
ATGACAGAGTCATATGTAAACAAAGAAGAAATCATCTCTTTAGCAAAGAATGCTGCATTG
GAGTTGGAAGATGCCCACGTGGAAGAGTTCGTAACATCTATGAATGACGTCATTGCTTTA
ATGCAGGAAGTAATCGCGATAGATATTTCGGATATCATTCTTGAAGCTACAGTGCATCAT
TTCGTTGGTCCAGAGGATCTTAGAGAAGACATGGTGACTTCGGATTTTACTCAAGAAGAA
TAG
Run Code Online (Sandbox Code Playgroud)
使用的grep选项是:
-F, --fixed-strings
Interpret PATTERN as a list of fixed strings (instead of regular
expressions), separated by newlines, any of which is to be matched.
-f FILE, --file=FILE
Obtain patterns from FILE, one per line. If this option is used
multiple times or is combined with the -e (--regexp) option, search
for all patterns given. The empty file contains zero patterns, and
therefore matches nothing.
-w, --word-regexp
Select only those lines containing matches that form whole words.
The test is that the matching substring must either be at the
beginning of the line, or preceded by a non-word constituent
character. Similarly, it must be either at the end of the line or
followed by a non-word constituent character. Word-constituent
characters are letters, digits, and the underscore.
Run Code Online (Sandbox Code Playgroud)
所以,-f让我们给它一个文件来读取搜索 [patterns from,-F告诉它把模式视为字符串而不是正则表达式(否则,如果你的 ID 包含.他们经常做的,那将被视为“匹配任何字符") 并-w确保我们只匹配整个 ID,因此foo不会匹配foobar.
或者,这是一个快速而肮脏的 perl one liner 来做你想做的事:
perl -e 'open(F,"File1.txt");while(<F>){/(\S+)/; $k{$1}++}; while(<>){if(/>\s*(\S+?)(\.| )/){if($k{$1}){$k=1}else{$k=0}; } print if $k==1;}' File2.fa
Run Code Online (Sandbox Code Playgroud)
或者,稍微不那么脏:
perl -e 'open(F,"File1.txt");
while(<F>){
/(\S+)/;
$k{$1}++
}
while(<>){
if(/>\s*(\S+?)(\.| )/){
if($k{$1}){$k=1}
else{$k=0}
}
print if $k==1
}' File2.fa
Run Code Online (Sandbox Code Playgroud)