小编joh*_*ohn的帖子

从FASTA文本文件在python中创建一个列表

我有像这个小例子的文本文件:

>ENST00000491024.1|ENSG00000187583.6|OTTHUMG00000040756.4|OTTHUMT00000097942.2|PLEKHN1-003|PLEKHN1|176
SLESSPDAPDHTSETSHSPLYADPYTPPATSHRRVTDVRGLEEFLSAMQSARGPTPSSPLPSVPVSVPASDPRSCSSGPAGPYLLSKKGALQSRAAQRHRGSAKDGGPQPPDAPQLVSSAREGSPEPWLPLTDGRSPRRSRDPGYDHLWDETLSSSHQKCPQLGGPEASGGLVQWI
>ENST00000433179.2|ENSG00000187642.5|OTTHUMG00000040757.3|-|C1orf170-201|C1orf170|696
MPTQDGQLRRPARPPGPRAWMEPRGGGSSQFSSCPGPASSGDQMQRLLQGPAPRPPGEPPGSPKSPGHSTGSQRPPDSPGAPPRSPSRKKRRAVGAKGGGHTGASASAQTGSPLLPAASPETAKLMAKAGQEELGPGPAGAPEPGPRSPVQEDRPGPGLGLSTPVPVTEQGTDQIRTPRRAKLHTVSTTVWEALPDVSRAKSDMAVSTPASEPQPDRDMAVSTPASEPQSDRDMAVSTPASEPQPDTDMAVSTPASEPQPDRDMAVSIPASKPQSDTAVSTPASEPQSSVALSTPISKPQLDTDVAVSTPASKHGLDVALPTAGPVAKLEVASSPPVSEAVPRMTESSGLVSTPVPRADAAGLAWPPTRRAGPDVVEMEAVVSEPSAGAPGCCSGAPALGLTQVPRKKKVRFSVAGPSPNKPGSGQASARPSAPQTATGAHGGPGAWEAVAVGPRPHQPRILKHLPRPPPSAVTRVGPGSSFAVTLPEAYEFFFCDTIEENEEAEAAAAGQDPAGVQWPDMCEFFFPDVGAQRSRRRGSPEPLPRADPVPAPIPGDPVPISIPEVYEHFFFGEDRLEGVLGPAVPLPLQALEPPRSASEGAGPGTPLKPAVVERLHLALRRAGELRGPVPSFAFSQNDMCLVFVAFATWAVRTSDPHTPDAWKTALLANVGTISAIRYFRRQVGQGRRSHSPSPSS
>ENST00000341290.2|ENSG00000187642.5|OTTHUMG00000040757.3|OTTHUMT00000097943.2|C1orf170-001|C1orf170|676
MEPRGGGSSQFSSCPGPASSGDQMQRLLQGPAPRPPGEPPGSPKSPGHSTGSQRPPDSPGAPPRSPSRKKRRAVGAKGGGHTGASASAQTGSPLLPAASPETAKLMAKAGQEELGPGPAGAPEPGPRSPVQEDRPGPGLGLSTPVPVTEQGTDQIRTPRRAKLHTVSTTVWEALPDVSRAKSDMAVSTPASEPQPDRDMAVSTPASEPQSDRDMAVSTPASEPQPDTDMAVSTPASEPQPDRDMAVSIPASKPQSDTAVSTPASEPQSSVALSTPISKPQLDTDVAVSTPASKHGLDVALPTAGPVAKLEVASSPPVSEAVPRMTESSGLVSTPVPRADAAGLAWPPTRRAGPDVVEMEAVVSEPSAGAPGCCSGAPALGLTQVPRKKKVRFSVAGPSPNKPGSGQASARPSAPQTATGAHGGPGAWEAVAVGPRPHQPRILKHLPRPPPSAVTRVGPGSSFAVTLPEAYEFFFCDTIEENEEAEAAAAGQDPAGVQWPDMCEFFFPDVGAQRSRRRGSPEPLPRADPVPAPIPGDPVPISIPEVYEHFFFGEDRLEGVLGPAVPLPLQALEPPRSASEGAGPGTPLKPAVVERLHLALRRAGELRGPVPSFAFSQNDMCLVFVAFATWAVRTSDPHTPDAWKTALLANVGTISAIRYFRRQVGQGRRSHSPSPSS
>ENST00000428771.2|ENSG00000188290.6|OTTHUMG00000040758.2|OTTHUMT00000097945.2|HES4-002|HES4|247
MAADTPGKPSASPMAGAPASASRTPDKPRSAAEHRKVGSRPGVRGATGGREGRGTQPVPDPQSSKPVMEKRRRARINESLAQLKTLILDALRKESSRHSKLEKADILEMTVRHLRSLRRVQVTAALSADPAVLGKYRAGFHECLAEVNRFLAGCEGVPADVRSRLLGHLAACLRQLGPSRRPASLSPAAPAEAPAPEVYAGRPLLPSLGGPFPLLAPPLLPGLTRALPAAPRAGPQGPGGPWRPWLR
Run Code Online (Sandbox Code Playgroud)

该文件被分成不同的组.每组有2个部分.第1部分开始,">"并且该部分中的元素被分割,"|"并且之后的线是第2部分.我试图从我的文件中创建一个Python列表,其中包含每个组ID部分的第6个元素.以下是小例子的预期输出:

list = ["PLEKHN1", "C1orf170", "C1orf170", "HES4"]
Run Code Online (Sandbox Code Playgroud)

我试图先导入一个字典,然后使用以下方法创建一个像预期输出的列表:

from itertools import groupby
with open('infile.txt') as f:
    groups = groupby(f, key=lambda x: not x.startswith(">"))
    d = {}
    for k,v in groups:
        if not k:
            key, val = list(v)[0].rstrip(), "".join(map(str.rstrip,next(groups)[1],""))
            d[key] = val

k = d.keys()
res = [el[5:] for s in k for el in s.split('|')]
Run Code Online (Sandbox Code Playgroud)

但它不会返回我想要的东西.你知道怎么解决吗?

python bioinformatics fasta biopython

4
推荐指数
1
解决办法
146
查看次数

标签 统计

bioinformatics ×1

biopython ×1

fasta ×1

python ×1