改进DNA比对去除的代码设计

Mic*_*udl 7 python algorithm optimization dna-sequence

这是一个关于更有效的代码设计的问题:

假设三个对齐的DNA序列(seq1,seq2和seq3;它们是每个串)代表两个基因(gene1和gene2).相对于比对的DNA序列,已知这些基因的起始和终止位置.

# Input
align = {"seq1":"ATGCATGC", # In seq1, gene1 and gene2 are of equal length
         "seq2":"AT----GC",
         "seq3":"A--CA--C"}
annos = {"seq1":{"gene1":[0,3], "gene2":[4,7]},
         "seq2":{"gene1":[0,3], "gene2":[4,7]},
         "seq3":{"gene1":[0,3], "gene2":[4,7]}}
Run Code Online (Sandbox Code Playgroud)

我希望从比对中去除间隙(即破折号)并保持基因起始和终止位置的相对关联.

# Desired output
align = {"seq1":"ATGCATGC", 
         "seq2":"ATGC",
         "seq3":"ACAC"}
annos = {"seq1":{"gene1":[0,3], "gene2":[4,7]},
         "seq2":{"gene1":[0,1], "gene2":[2,3]},
         "seq3":{"gene1":[0,1], "gene2":[2,3]}}
Run Code Online (Sandbox Code Playgroud)

获得所需的输出并不像看起来那么简单.下面我为这个问题写了一些(行编号)伪代码,但肯定有更优雅的设计.

1  measure length of any aligned gene  # take any seq, since all seqs aligned
2  list_lengths = list of gene lengths  # order is important
3  for seq in alignment
4      outseq = ""
5      for each num in range(0, length(seq))  # weird for-loop is intentional
6          if seq[num] == "-"
7              current_gene = gene whose start/stop positions include num
8              subtract 1 from length of current_gene
9              subtract 1 from lengths of all genes following current_gene in list_lengths
10         else
11             append seq[num] to outseq
12     append outseq to new variable
13     convert gene lengths into start/stop positions and append ordered to new variable
Run Code Online (Sandbox Code Playgroud)

任何人都可以给我一些更短,更直接的代码设计的建议/示例吗?

Kev*_*vin 3

此答案处理annos从评论到 cdlanes 答案的更新字典。该答案使annos字典中的 的索引为 [2,1] 不正确seq2 gene2gene如果序列包含该区域中的所有间隙,我提出的解决方案将从字典中删除该条目。另请注意,如果一个基因在最后仅包含一个字母align,那么anno[geneX]开始和停止的索引将相同 - >请参阅seq3 gene1您的评论annos

align = {"seq1":"ATGCATGC",
         "seq2":"AT----GC",
         "seq3":"A--CA--C"}

annos = {"seq1":{"gene1":[0,3], "gene2":[4,7]},
         "seq2":{"gene1":[0,3], "gene2":[4,7]},
         "seq3":{"gene1":[0,3], "gene2":[4,7]}}


annos3 = {"seq1":{"gene1":[0,2], "gene2":[3,4], "gene3":[5,7]}, 
          "seq2":{"gene1":[0,2], "gene2":[3,4], "gene3":[5,7]}, 
          "seq3":{"gene1":[0,2], "gene2":[3,4], "gene3":[5,7]}}

import re
for name,anno in annos.items():
    # indices of gaps removed usinig re
    removed = [(m.start(0)) for m in re.finditer(r'-', align[name])]

    # removes gaps from align dictionary
    align[name] = re.sub(r'-', '', align[name])

    build_dna = ''
    for gene,inds in anno.items():

        start_ind = len(build_dna)+1

        #generator to sum the num '-' removed from gene
        num_gaps = sum(1 for i in removed if i >= inds[0] and i <= inds[1])

        # build the de-gapped string
        build_dna+= align[name][inds[0]:inds[1]+1].replace("-", "")

        end_ind = len(build_dna)

        if num_gaps == len(align[name][inds[0]:inds[1]+1]): #gene is all gaps
            del annos[name][gene] #remove the gene entry
            continue
        #update the values in the annos dictionary
        annos[name][gene][0] = start_ind-1
        annos[name][gene][1] = end_ind-1
Run Code Online (Sandbox Code Playgroud)

结果:

In [3]: annos
Out[3]:  {'seq1': {'gene1': [0, 3], 'gene2': [4, 7]},
          'seq2': {'gene1': [0, 1], 'gene2': [2, 3]},
          'seq3': {'gene1': [0, 1], 'gene2': [2, 3]}}
Run Code Online (Sandbox Code Playgroud)

上述3 个基因的结果annos。只需替换annos变量:

In [5]: annos3
Out[5]:  {'seq1': {'gene1': [0, 2], 'gene2': [3, 4], 'gene3': [5, 7]},
          'seq2': {'gene1': [0, 1], 'gene3': [2, 3]},
          'seq3': {'gene1': [0, 0], 'gene2': [1, 2], 'gene3': [3, 3]}}
Run Code Online (Sandbox Code Playgroud)