如何使用Python使用fasta之类的标头将字符串序列格式化为预定长度

sca*_*der 2 python string bioinformatics python-3.x

我有一个名为的文本文件input.txt,看起来像这样。

A C H E C Q D S S C H H C R Q K L E D T S C H L E D V G K M
N T Y H C G E G I N N G P N A S C K F M L P C V V A E F E N H T
E T D W R C K L E A E H C D C K D A A V N H H F Y S L C K D V T E E W
Run Code Online (Sandbox Code Playgroud)

请注意,上面的输入有 3 行氨基酸序列。

我有一个名为 input.txt 的文本文件,如下所示。

A C H E C Q D S S C H H C R Q K L E D T S C H L E D V G K M
N T Y H C G E G I N N G P N A S C K F M L P C V V A E F E N H T
E T D W R C K L E A E H C D C K D A A V N H H F Y S L C K D V T E E W
Run Code Online (Sandbox Code Playgroud)

请注意,上面的输入有 3 行氨基酸序列。

我想将其转换为下面的格式。

<|endoftext|>
ACHECQDSSCHHCRQKLEDTSCHLEDVGKM
<|endoftext|>
NTYHCGEGINNGPNASCKFMLPCVVAEFEN
HT
<|endoftext|>
ETDWRCKLEAEHCDCKDAAVNHHFYSLCKD
VTEEW
Run Code Online (Sandbox Code Playgroud)

氨基酸序列的每个开头都应以此字符串“<|endoftext|>”开头,并且每行新行不应超过 30 个氨基酸。

我有这段代码,但它不起作用:

def process_amino_acids(file_name):
    with open(file_name, "r") as file:
        data = file.read().replace("\n", "").replace(" ", "")

    output = "<|endoftext|>"
    for i, amino_acid in enumerate(data):
        if i % 30 == 0 and i != 0:
            output += "\n"
        output += amino_acid
    return output


def main():
    input_file = "data/input.txt"
    processed_amino_acids = process_amino_acids(input_file)

    with open("data/output.txt", "w") as output_file:
        output_file.write(processed_amino_acids)

    print("Formatted amino acid sequences are written to output.txt")


if __name__ == "__main__":
    main()
Run Code Online (Sandbox Code Playgroud)

它给出的输出是这样的:

<|endoftext|>ACHECQDSSCHHCRQKLEDTSCHLEDVGKM
NTYHCGEGINNGPNASCKFMLPCVVAEFEN
HTETDWRCKLEAEHCDCKDAAVNHHFYSLC
KDVTEEW
Run Code Online (Sandbox Code Playgroud)

我怎样才能用Python正确地做到这一点?

moz*_*way 5

Python 附带电池,使用textwrap.wrap

from textwrap import wrap

with (open('input.txt') as f,
      open('output.txt', 'w') as out
     ):
    for line in f:
        line = line.replace(' ', '')
        out.write('<|endoftext|>\n')
        out.write('\n'.join(wrap(line, width=30)))
        out.write('\n')
Run Code Online (Sandbox Code Playgroud)

输出文件:

<|endoftext|>
ACHECQDSSCHHCRQKLEDTSCHLEDVGKM
<|endoftext|>
NTYHCGEGINNGPNASCKFMLPCVVAEFEN
HT
<|endoftext|>
ETDWRCKLEAEHCDCKDAAVNHHFYSLCKD
VTEEW
Run Code Online (Sandbox Code Playgroud)