mac*_*age 2 python parsing bioinformatics fasta biopython
我是 Biopython 的新手(以及一般编码),我正在尝试编写一种方法,将一系列 DNA 序列(超过 80 个)翻译成蛋白质序列,在一个单独的 FASTA 文件中。我还想在正确的阅读框中找到序列。
这是我到目前为止所拥有的:
from Bio import SeqIO
from Bio.SeqRecord import SeqRecord
for record in SeqIO.parse("dnaseq.fasta", "fasta"):
protein_id = record.id
protein1 = record.seq.translate(to_stop=True)
protein2 = record.seq[1:].translate(to_stop=True)
protein3 = record.seq[2:].translate(to_stop=True)
if len(protein1) > len(protein2) and len(protein1) > len(protein3):
protein = protein1
elif len(protein2) > len(protein1) and len(protein2) > len(protein3):
protein = protein2
else:
protein = protein3
def prot_record(record):
return SeqRecord(seq = protein, \
id = ">" + protein_id, \
description = "translated sequence")
records = map(prot_record, SeqIO.parse("dnaseq.fasta", "fasta"))
SeqIO.write(records, "AAseq.fasta", "fasta")
Run Code Online (Sandbox Code Playgroud)
我当前代码的问题是,虽然它似乎可以工作,但它只给出输入文件的最后一个序列。谁能帮我弄清楚如何编写所有序列?
谢谢!
正如其他人所提到的,您的代码在尝试写入结果之前会遍历整个输入。我想建议如何使用流媒体方法来做到这一点:
from Bio import SeqIO
from Bio.SeqRecord import SeqRecord
with open("AAseq.fasta", 'w') as aa_fa:
for dna_record in SeqIO.parse("dnaseq.fasta", 'fasta'):
# use both fwd and rev sequences
dna_seqs = [dna_record.seq, dna_record.seq.reverse_complement()]
# generate all translation frames
aa_seqs = (s[i:].translate(to_stop=True) for i in range(3) for s in dna_seqs)
# select the longest one
max_aa = max(aa_seqs, key=len)
# write new record
aa_record = SeqRecord(max_aa, id=dna_record.id, description="translated sequence")
SeqIO.write(aa_record, aa_fa, 'fasta')
Run Code Online (Sandbox Code Playgroud)
这里的主要改进是:
if...elif...else
通过使用max
键来避免结构。