Rosalind:重叠图

use*_*351 1 python string graph-theory

我在罗莎琳德上遇到了一个问题,我认为我已经正确解决了,但我被告知我的答案不正确。该问题可以在这里找到:http://rosalind.info/problems/grph/

\n\n

它是基本图论,更具体地说,它涉及返回重叠 DNA 字符串的邻接列表。

\n\n

“对于一个字符串集合和一个正整数k,字符串的重叠图是一个有向图Ok,其中每个字符串由一个节点表示,当有长度时,字符串s与字符串t通过有向边连接s 的 k 后缀与 t 的长度 k 前缀匹配,只要 s\xe2\x89\xa0t;我们要求 s\xe2\x89\xa0t 来防止重叠图中出现有向循环(尽管可能存在有向循环)。

\n\n

给定:FASTA 格式的 DNA 字符串集合,总长度最多为 10 kbp。

\n\n

返回: O3对应的邻接表。您可以按任何顺序返回边。”

\n\n

所以,如果你有:

\n\n
\n

Rosalind_0498\n AAATAAA

\n\n

Rosalind_2391\n AAATTTT

\n\n

Rosalind_2323\n TTTTCCC

\n\n

Rosalind_0442\n AAATCCC

\n\n

Rosalind_5013\n GGGTGGG

\n
\n\n

您必须返回:

\n\n

罗莎琳德_0498 罗莎琳德_2391

\n\n

罗莎琳德_0498 罗莎琳德_0442

\n\n

罗莎琳德_2391 罗莎琳德_2323

\n\n

解析包含 DNA 字符串的 FASTA 文件后,我的 python 代码如下:

\n\n
        listTitle = []\n        listContent = []\n\n    #SPLIT is the parsed list of DNA strings\n\n    #here i create two new lists, one (listTitle) containing the four numbers identifying a particular string, and the second (listContent) containing the actual strings (\'>Rosalind_\' has been removed, because it is what I split the file with)\n\n        while i < len(SPLIT):\n            curr = SPLIT[i]\n            title = curr[0:4:1]\n            listTitle.append(title)\n            content = curr[4::1]\n            listContent.append(content)\n            i+=1\n\n        start = []\n        end = []\n\n        #now I create two new lists, one containing the first three chars of the string and the second containing the last three chars, a particular string\'s index will be the same in both lists, as well as in the title list\n\n        for item in listContent:\n            start.append(item[0:3:1])\n            end.append(item[len(item)-3:len(item):1])\n\n        list = []\n\n   #then I iterate through both lists, checking if the suffix and prefix are equal, but not originating from the same string, and append their titles to a last list\n\n        p=0\n        while p<len(end):\n            iterator=0\n            while iterator<len(start):\n                if p!=iterator:\n                    if end[p] == start[iterator]:\n                        one=listTitle[p]\n                        two=listTitle[iterator]\n                        list.append(one)\n                        list.append(two)\n                iterator+=1\n            p+=1\n\n#finally I print the list in the format that they require for the answer\n\n        listInc=0\n\n        while listInc < len(list):\n                print "Rosalind_"+list[listInc]+\' \'+"Rosalind_"+list[listInc+1]\n                listInc+=2\n
Run Code Online (Sandbox Code Playgroud)\n\n

我哪里错了?抱歉,代码有点乏味,我很少接受过 python 培训

\n

jme*_*jme 5

我不确定你的代码有什么问题,但这里有一种可能被认为更“Pythonic”的方法。

我假设您已将数据读入字典中,将名称映射到 DNA 字符串:

{'Rosalind_0442': 'AAATCCC',
 'Rosalind_0498': 'AAATAAA',
 'Rosalind_2323': 'TTTTCCC',
 'Rosalind_2391': 'AAATTTT',
 'Rosalind_5013': 'GGGTGGG'}
Run Code Online (Sandbox Code Playgroud)

我们定义一个简单的函数来检查字符串的后缀是否s1与字符串的前缀k匹配:ks2

def is_k_overlap(s1, s2, k):
    return s1[-k:] == s2[:k]
Run Code Online (Sandbox Code Playgroud)

然后我们查看 DNA 序列的所有组合以找到匹配的组合。这可以通过以下方式轻松实现itertools.combinations

import itertools
def k_edges(data, k):
    edges = []
    for u,v in itertools.combinations(data, 2):
        u_dna, v_dna = data[u], data[v]

        if is_k_overlap(u_dna, v_dna, k):
            edges.append((u,v))

        if is_k_overlap(v_dna, u_dna, k):
            edges.append((v,u))

    return edges
Run Code Online (Sandbox Code Playgroud)

例如,根据上面的数据我们得到:

>>> k_edges(data, 3)
[('Rosalind_2391', 'Rosalind_2323'),
 ('Rosalind_0498', 'Rosalind_2391'),
 ('Rosalind_0498', 'Rosalind_0442')]
Run Code Online (Sandbox Code Playgroud)