我正在学习python,我想在不使用BioPython的情况下解析fasta文件.我的txt文件看起来像:
>22567
CGTGTCCAGGTCTATCTCGGAAATTTGCCGTCGTTGCATTACTGTCCAGCTCCATGCCCA
ACATTTGGCATCGGAGAATGACTCCGCGTGATAAAGTCAGAATAGGCATTGAGACTCAGG
GTGGTACCTATTA
>34454
AAAACTGTGCAGCCGGTAACAGGCCGCGATGCTGTACTATATGTGTTTGGTACATATCCG
ATTCAGGTATGTCAGGGAGCCAGCACCGGAGGATCCAGAAGTAAGTCGGGTTGACTACTC
CTAGCCTCGTTTCACCATCCGCCGGATAACTCTCCCTTCCATCATCAACTCCTCCCTTTC
GTGTCCAATGGGGCGGCGTGTCTAAGCACTGCCATATAGCTACCGAAAGGCGGCGACCCC
TCGGA
Run Code Online (Sandbox Code Playgroud)
我想解析这个以保存每个序列的标题,> 22567和> 34454到标题列表(这是有效的).并且在每个标题之后读取下面的序列到序列列表中.
输出,我想看起来像:
headers = ['>22567','>34454']
sequences = ['CGTGTCCAGGTCTATCTCGGAAATT...', AAAACTTTGTGAAAA....']
Run Code Online (Sandbox Code Playgroud)
我遇到的问题是当我尝试读取序列部分时,我无法弄清楚如何将每一行连接成一个序列字符串,然后再将其附加到列表中.相反,我拥有的每一行都附加到序列列表中.
我到目前为止的代码是:
#!/usr/bin/python
import re
dna = []
sequences = []
def read_fasta(filename):
global seq, header, dna, sequences
#open the file
with open(filename) as file:
seq = ''
#forloop through the lines
for line in file:
header = re.search(r'^>\w+', line)
#if line contains the header '>' then append it to the dna list
if header:
line = line.rstrip("\n")
dna.append(line)
# in the else statement is where I have problems, what I would like is
#else:
#the proceeding lines before the next '>' is the sequence for each header,
#concatenate these lines into one string and append to the sequences list
else:
seq = line.replace('\n', '')
sequences.append(seq)
filename = 'gc.txt'
read_fasta(filename)
Run Code Online (Sandbox Code Playgroud)
注意:我在我的一个项目上有这个解决方案,所以我直接在这里粘贴它.然而,解决方案不是我的,这里属于这张海报.请提出他/她的回答.感谢@donkeykong找到原帖
使用列表累积行,直到您获得新ID.然后将这些行连接在一起并将其与id一起存储在字典中.以下函数接受一个打开的文件并生成每对(id,sequence).
def read_fasta(fp):
name, seq = None, []
for line in fp:
line = line.rstrip()
if line.startswith(">"):
if name: yield (name, ''.join(seq))
name, seq = line, []
else:
seq.append(line)
if name: yield (name, ''.join(seq))
with open('ex.fasta') as fp:
for name, seq in read_fasta(fp):
print(name, seq)
Run Code Online (Sandbox Code Playgroud)
输出:
('>22567', 'CGTGTCCAGGTCTATCTCGGAAATTTGCCGTCGTTGCATTACTGTCCAGCTCCATGCCCAACATTTGGCATCGGAGAATGACTCCGCGTGATAAAGTCAGAATAGGCATTGAGACTCAGGGTGGTACCTATTA')
('>34454', 'AAAACTGTGCAGCCGGTAACAGGCCGCGATGCTGTACTATATGTGTTTGGTACATATCCGATTCAGGTATGTCAGGGAGCCAGCACCGGAGGATCCAGAAGTAAGTCGGGTTGACTACTCCTAGCCTCGTTTCACCATCCGCCGGATAACTCTCCCTTCCATCATCAACTCCTCCCTTTCGTGTCCAATGGGGCGGCGTGTCTAAGCACTGCCATATAGCTACCGAAAGGCGGCGACCCCTCGGA')
Run Code Online (Sandbox Code Playgroud)
这是SO的答案.我会试着找到它并给原始海报一个功劳.