Rosalind：重叠图

Question

Rosalind：重叠图

我在罗莎琳德上遇到了一个问题，我认为我已经正确解决了，但我被告知我的答案不正确。该问题可以在这里找到：http://rosalind.info/problems/grph/

\n\n

它是基本图论，更具体地说，它涉及返回重叠 DNA 字符串的邻接列表。

\n\n

“对于一个字符串集合和一个正整数k，字符串的重叠图是一个有向图Ok，其中每个字符串由一个节点表示，当有长度时，字符串s与字符串t通过有向边连接s 的 k 后缀与 t 的长度 k 前缀匹配，只要 s\xe2\x89\xa0t；我们要求 s\xe2\x89\xa0t 来防止重叠图中出现有向循环（尽管可能存在有向循环）。

\n\n

给定：FASTA 格式的 DNA 字符串集合，总长度最多为 10 kbp。

\n\n

返回： O3对应的邻接表。您可以按任何顺序返回边。”

\n\n

所以，如果你有：

\n\n

\n
Rosalind_0498\n AAATAAA
\n\n
Rosalind_2391\n AAATTTT
\n\n
Rosalind_2323\n TTTTCCC
\n\n
Rosalind_0442\n AAATCCC
\n\n
Rosalind_5013\n GGGTGGG
\n

\n\n

您必须返回：

\n\n

罗莎琳德_0498 罗莎琳德_2391

\n\n

罗莎琳德_0498 罗莎琳德_0442

\n\n

罗莎琳德_2391 罗莎琳德_2323

\n\n

解析包含 DNA 字符串的 FASTA 文件后，我的 python 代码如下：

\n\n

        listTitle = []\n        listContent = []\n\n    #SPLIT is the parsed list of DNA strings\n\n    #here i create two new lists, one (listTitle) containing the four numbers identifying a particular string, and the second (listContent) containing the actual strings (\'>Rosalind_\' has been removed, because it is what I split the file with)\n\n        while i < len(SPLIT):\n            curr = SPLIT[i]\n            title = curr[0:4:1]\n            listTitle.append(title)\n            content = curr[4::1]\n            listContent.append(content)\n            i+=1\n\n        start = []\n        end = []\n\n        #now I create two new lists, one containing the first three chars of the string and the second containing the last three chars, a particular string\'s index will be the same in both lists, as well as in the title list\n\n        for item in listContent:\n            start.append(item[0:3:1])\n            end.append(item[len(item)-3:len(item):1])\n\n        list = []\n\n   #then I iterate through both lists, checking if the suffix and prefix are equal, but not originating from the same string, and append their titles to a last list\n\n        p=0\n        while p<len(end):\n            iterator=0\n            while iterator<len(start):\n                if p!=iterator:\n                    if end[p] == start[iterator]:\n                        one=listTitle[p]\n                        two=listTitle[iterator]\n                        list.append(one)\n                        list.append(two)\n                iterator+=1\n            p+=1\n\n#finally I print the list in the format that they require for the answer\n\n        listInc=0\n\n        while listInc < len(list):\n                print "Rosalind_"+list[listInc]+\' \'+"Rosalind_"+list[listInc+1]\n                listInc+=2\n

Run Code Online (Sandbox Code Playgroud)\n\n

我哪里错了？抱歉，代码有点乏味，我很少接受过 python 培训

\n

Answer 1

jme*_*jme 5

我不确定你的代码有什么问题，但这里有一种可能被认为更“Pythonic”的方法。

我假设您已将数据读入字典中，将名称映射到 DNA 字符串：

{'Rosalind_0442': 'AAATCCC',
 'Rosalind_0498': 'AAATAAA',
 'Rosalind_2323': 'TTTTCCC',
 'Rosalind_2391': 'AAATTTT',
 'Rosalind_5013': 'GGGTGGG'}

Run Code Online (Sandbox Code Playgroud)

我们定义一个简单的函数来检查字符串的后缀是否s1与字符串的前缀k匹配：ks2

def is_k_overlap(s1, s2, k):
    return s1[-k:] == s2[:k]

Run Code Online (Sandbox Code Playgroud)

然后我们查看 DNA 序列的所有组合以找到匹配的组合。这可以通过以下方式轻松实现itertools.combinations：

import itertools
def k_edges(data, k):
    edges = []
    for u,v in itertools.combinations(data, 2):
        u_dna, v_dna = data[u], data[v]

        if is_k_overlap(u_dna, v_dna, k):
            edges.append((u,v))

        if is_k_overlap(v_dna, u_dna, k):
            edges.append((v,u))

    return edges

Run Code Online (Sandbox Code Playgroud)

例如，根据上面的数据我们得到：

>>> k_edges(data, 3)
[('Rosalind_2391', 'Rosalind_2323'),
 ('Rosalind_0498', 'Rosalind_2391'),
 ('Rosalind_0498', 'Rosalind_0442')]

Run Code Online (Sandbox Code Playgroud)

归档时间：	10 年，11 月前
查看次数：	4220 次
最近记录：	10 年，11 月前