如何使用Python随机提取FASTA序列?

Xio*_*g89 2 python extract bioinformatics extraction fasta

我有以下序列,它们是具有序列标题及其核苷酸的fasta格式.如何随机提取序列.例如,我想从总序列中随机选择2个序列.提供的工具是根据百分比而不是序列数提取.谁能帮我?

A.fasta

>chr1:1310706-1310726
GACGGTTTCCGGTTAGTGGAA
>chr1:901959-901979
GAGGGCTTTCTGGAGAAGGAG
>chr1:983001-983021
GTCCGCTTGCGGGACCTGGGG
>chr1:984333-984353
CTGGAATTCCGGGCGCTGGAG
>chr1:1154147-1154167
GAGATCGTCCGGGACCTGGGT
Run Code Online (Sandbox Code Playgroud)

预期产出

>chr1:1154147-1154167
GAGATCGTCCGGGACCTGGGT
>chr1:901959-901979
GAGGGCTTTCTGGAGAAGGAG
Run Code Online (Sandbox Code Playgroud)

Pad*_*ham 7

如果您正在使用fasta文件,请使用BioPython,以获取n序列使用random.sample:

from Bio import SeqIO
from random import sample
with open("foo.fasta") as f:
    seqs = SeqIO.parse(f,"fasta")
    print(sample(list(seqs), 2))
Run Code Online (Sandbox Code Playgroud)

输出:

[SeqRecord(seq=Seq('GAGATCGTCCGGGACCTGGGT', SingleLetterAlphabet()), id='chr1:1154147-1154167', name='chr1:1154147-1154167', description='chr1:1154147-1154167', dbxrefs=[]), SeqRecord(seq=Seq('GTCCGCTTGCGGGACCTGGGG', SingleLetterAlphabet()), id='chr1:983001-983021', name='chr1:983001-983021', description='chr1:983001-983021', dbxrefs=[])]
Run Code Online (Sandbox Code Playgroud)

如有必要,您可以提取字符串:

 print([(seq.name,str(seq.seq)) for seq in  sample(list(seqs),2)])
 [('chr1:1310706-1310726', 'GACGGTTTCCGGTTAGTGGAA'), ('chr1:983001-983021', 'GTCCGCTTGCGGGACCTGGGG')]
Run Code Online (Sandbox Code Playgroud)

如果这些行总是成对出现并且您跳过顶部的元数据,则可以压缩:

from random import sample

with open("foo.fasta") as f:
    print(sample(list(zip(f, f)), 2))
Run Code Online (Sandbox Code Playgroud)

这将为您提供元组中的线对:

[('>chr1:983001-983021\n', 'GTCCGCTTGCGGGACCTGGGG\n'), ('>chr1:984333-984353\n', 'CTGGAATTCCGGGCGCTGGAG\n')]
Run Code Online (Sandbox Code Playgroud)

为了准备好写行:

from Bio import SeqIO
from random import sample
with open("foo.fasta") as f:
    seqs = SeqIO.parse(f, "fasta")
    samps = ((seq.name, seq.seq) for seq in  sample(list(seqs),2))
    for samp in samps:
        print(">{}\n{}".format(*samp))
Run Code Online (Sandbox Code Playgroud)

输出:

>chr1:1310706-1310726
GACGGTTTCCGGTTAGTGGAA
>chr1:983001-983021
GTCCGCTTGCGGGACCTGGGG
Run Code Online (Sandbox Code Playgroud)