学习用python解析一个fasta文件

DJF*_*DJF 3 python

我正在学习python,我想在不使用BioPython的情况下解析fasta文件.我的txt文件看起来像:

>22567
CGTGTCCAGGTCTATCTCGGAAATTTGCCGTCGTTGCATTACTGTCCAGCTCCATGCCCA
ACATTTGGCATCGGAGAATGACTCCGCGTGATAAAGTCAGAATAGGCATTGAGACTCAGG
GTGGTACCTATTA
>34454
AAAACTGTGCAGCCGGTAACAGGCCGCGATGCTGTACTATATGTGTTTGGTACATATCCG
ATTCAGGTATGTCAGGGAGCCAGCACCGGAGGATCCAGAAGTAAGTCGGGTTGACTACTC
CTAGCCTCGTTTCACCATCCGCCGGATAACTCTCCCTTCCATCATCAACTCCTCCCTTTC
GTGTCCAATGGGGCGGCGTGTCTAAGCACTGCCATATAGCTACCGAAAGGCGGCGACCCC
TCGGA
Run Code Online (Sandbox Code Playgroud)

我想解析这个以保存每个序列的标题,> 22567和> 34454到标题列表(这是有效的).并且在每个标题之后读取下面的序列到序列列表中.

输出,我想看起来像:

headers =  ['>22567','>34454']
sequences = ['CGTGTCCAGGTCTATCTCGGAAATT...', AAAACTTTGTGAAAA....']  
Run Code Online (Sandbox Code Playgroud)

我遇到的问题是当我尝试读取序列部分时,我无法弄清楚如何将每一行连接成一个序列字符串,然后再将其附加到列表中.相反,我拥有的每一行都附加到序列列表中.

我到目前为止的代码是:

#!/usr/bin/python 

import re 

dna = []
sequences = []


def read_fasta(filename):
    global seq, header, dna, sequences 

#open the file  
    with open(filename) as file:    
        seq = ''        
        #forloop through the lines
        for line in file: 
            header = re.search(r'^>\w+', line)
            #if line contains the header '>' then append it to the dna list 
            if header:
                line = line.rstrip("\n")
                dna.append(line)            
            # in the else statement is where I have problems, what I would like is
            #else: 
                #the proceeding lines before the next '>' is the sequence for each header,
                #concatenate these lines into one string and append to the sequences list 
            else:               
                seq = line.replace('\n', '')  
                sequences.append(seq)      

filename = 'gc.txt'

read_fasta(filename)
Run Code Online (Sandbox Code Playgroud)

let*_*tsc 6

注意:我在我的一个项目上有这个解决方案,所以我直接在这里粘贴它.然而,解决方案不是我的,这里属于这张海报.请提出他/她的回答.感谢@donkeykong找到原帖

使用列表累积行,直到您获得新ID.然后将这些行连接在一起并将其与id一起存储在字典中.以下函数接受一个打开的文件并生成每对(id,sequence).

def read_fasta(fp):
        name, seq = None, []
        for line in fp:
            line = line.rstrip()
            if line.startswith(">"):
                if name: yield (name, ''.join(seq))
                name, seq = line, []
            else:
                seq.append(line)
        if name: yield (name, ''.join(seq))

with open('ex.fasta') as fp:
    for name, seq in read_fasta(fp):
        print(name, seq)
Run Code Online (Sandbox Code Playgroud)

输出:

('>22567', 'CGTGTCCAGGTCTATCTCGGAAATTTGCCGTCGTTGCATTACTGTCCAGCTCCATGCCCAACATTTGGCATCGGAGAATGACTCCGCGTGATAAAGTCAGAATAGGCATTGAGACTCAGGGTGGTACCTATTA')
('>34454', 'AAAACTGTGCAGCCGGTAACAGGCCGCGATGCTGTACTATATGTGTTTGGTACATATCCGATTCAGGTATGTCAGGGAGCCAGCACCGGAGGATCCAGAAGTAAGTCGGGTTGACTACTCCTAGCCTCGTTTCACCATCCGCCGGATAACTCTCCCTTCCATCATCAACTCCTCCCTTTCGTGTCCAATGGGGCGGCGTGTCTAAGCACTGCCATATAGCTACCGAAAGGCGGCGACCCCTCGGA')
Run Code Online (Sandbox Code Playgroud)

这是SO的答案.我会试着找到它并给原始海报一个功劳.

  • [这里](/sf/ask/535848001/)原作 (3认同)