将Ns添加到可变长度序列的最快方法,使它们都等于150bp

use*_*519 5 bioinformatics dna-sequence biopython

假设我有一个包含3个序列的fasta ...

ATTTTTGGA
AT
A
Run Code Online (Sandbox Code Playgroud)

我希望我的序列数据看起来像这样:

ATTTTTGGA
ATTNNNNNN
ANNNNNNNN
Run Code Online (Sandbox Code Playgroud)

是否有任何程序或脚本可以在合理的时间范围内完成此任务.我有成千上万的序列.谢谢!

我正在乱搞并尝试这个,文件最后空白,但这是我已经得到的.

import sys
from Bio import SeqIO
from Bio.Seq import Seq
in_file = open(sys.argv[1],'r')
sequences = SeqIO.parse(in_file, "fasta")
output_in_file = open("test.fasta", "w")
for record in sequences:
    n = 150
    record.seq = record.seq + ("N" * n)
    seq = seq[:n]
output_in_file.close()
in_file.close()
Run Code Online (Sandbox Code Playgroud)

Jos*_* M. 4

改进你的代码,

import sys
from Bio import SeqIO
from Bio.Seq import Seq
with open(sys.argv[1], "r") as in_file:
    sequences = list(SeqIO.parse(in_file, "fasta"))
    n = max(map(len, sequences))   #find max len in sequences
    for record in sequences:
        record.seq = record.seq + ("N" * (n-len(record)))
    SeqIO.write(sequences, "test.fasta", "fasta")
Run Code Online (Sandbox Code Playgroud)

你得到,在test.fasta

>id_1
ATTTTTGGA
>id_2
ATNNNNNN
>id_3
AnnNNNNNN

对于“全部相等 150bp”

import sys
from Bio import SeqIO
from Bio.Seq import Seq
with open(sys.argv[1], "r") as in_file:
    sequences = list(SeqIO.parse(in_file, "fasta"))
    n = 150
    for record in sequences:
        record.seq = record.seq + ("N" * (n-len(record)))
    SeqIO.write(sequences, "test.fasta", "fasta")
Run Code Online (Sandbox Code Playgroud)

你得到,

>id_1
ATTTTTGGANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
>id_2
ATNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
>id_3
安娜恩恩恩恩恩恩恩恩恩恩恩恩恩恩恩恩恩恩恩恩恩恩恩恩恩恩恩恩恩恩恩恩恩恩恩恩恩恩恩恩恩恩恩恩恩恩恩恩
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNN