use*_*351 1 python string graph-theory
我在罗莎琳德上遇到了一个问题,我认为我已经正确解决了,但我被告知我的答案不正确。该问题可以在这里找到:http://rosalind.info/problems/grph/
\n\n它是基本图论,更具体地说,它涉及返回重叠 DNA 字符串的邻接列表。
\n\n“对于一个字符串集合和一个正整数k,字符串的重叠图是一个有向图Ok,其中每个字符串由一个节点表示,当有长度时,字符串s与字符串t通过有向边连接s 的 k 后缀与 t 的长度 k 前缀匹配,只要 s\xe2\x89\xa0t;我们要求 s\xe2\x89\xa0t 来防止重叠图中出现有向循环(尽管可能存在有向循环)。
\n\n给定:FASTA 格式的 DNA 字符串集合,总长度最多为 10 kbp。
\n\n返回: O3对应的邻接表。您可以按任何顺序返回边。”
\n\n所以,如果你有:
\n\n\n\n\nRosalind_0498\n AAATAAA
\n\nRosalind_2391\n AAATTTT
\n\nRosalind_2323\n TTTTCCC
\n\nRosalind_0442\n AAATCCC
\n\nRosalind_5013\n GGGTGGG
\n
您必须返回:
\n\n罗莎琳德_0498 罗莎琳德_2391
\n\n罗莎琳德_0498 罗莎琳德_0442
\n\n罗莎琳德_2391 罗莎琳德_2323
\n\n解析包含 DNA 字符串的 FASTA 文件后,我的 python 代码如下:
\n\n listTitle = []\n listContent = []\n\n #SPLIT is the parsed list of DNA strings\n\n #here i create two new lists, one (listTitle) containing the four numbers identifying a particular string, and the second (listContent) containing the actual strings (\'>Rosalind_\' has been removed, because it is what I split the file with)\n\n while i < len(SPLIT):\n curr = SPLIT[i]\n title = curr[0:4:1]\n listTitle.append(title)\n content = curr[4::1]\n listContent.append(content)\n i+=1\n\n start = []\n end = []\n\n #now I create two new lists, one containing the first three chars of the string and the second containing the last three chars, a particular string\'s index will be the same in both lists, as well as in the title list\n\n for item in listContent:\n start.append(item[0:3:1])\n end.append(item[len(item)-3:len(item):1])\n\n list = []\n\n #then I iterate through both lists, checking if the suffix and prefix are equal, but not originating from the same string, and append their titles to a last list\n\n p=0\n while p<len(end):\n iterator=0\n while iterator<len(start):\n if p!=iterator:\n if end[p] == start[iterator]:\n one=listTitle[p]\n two=listTitle[iterator]\n list.append(one)\n list.append(two)\n iterator+=1\n p+=1\n\n#finally I print the list in the format that they require for the answer\n\n listInc=0\n\n while listInc < len(list):\n print "Rosalind_"+list[listInc]+\' \'+"Rosalind_"+list[listInc+1]\n listInc+=2\nRun Code Online (Sandbox Code Playgroud)\n\n我哪里错了?抱歉,代码有点乏味,我很少接受过 python 培训
\n我不确定你的代码有什么问题,但这里有一种可能被认为更“Pythonic”的方法。
我假设您已将数据读入字典中,将名称映射到 DNA 字符串:
{'Rosalind_0442': 'AAATCCC',
'Rosalind_0498': 'AAATAAA',
'Rosalind_2323': 'TTTTCCC',
'Rosalind_2391': 'AAATTTT',
'Rosalind_5013': 'GGGTGGG'}
Run Code Online (Sandbox Code Playgroud)
我们定义一个简单的函数来检查字符串的后缀是否s1与字符串的前缀k匹配:ks2
def is_k_overlap(s1, s2, k):
return s1[-k:] == s2[:k]
Run Code Online (Sandbox Code Playgroud)
然后我们查看 DNA 序列的所有组合以找到匹配的组合。这可以通过以下方式轻松实现itertools.combinations:
import itertools
def k_edges(data, k):
edges = []
for u,v in itertools.combinations(data, 2):
u_dna, v_dna = data[u], data[v]
if is_k_overlap(u_dna, v_dna, k):
edges.append((u,v))
if is_k_overlap(v_dna, u_dna, k):
edges.append((v,u))
return edges
Run Code Online (Sandbox Code Playgroud)
例如,根据上面的数据我们得到:
>>> k_edges(data, 3)
[('Rosalind_2391', 'Rosalind_2323'),
('Rosalind_0498', 'Rosalind_2391'),
('Rosalind_0498', 'Rosalind_0442')]
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
4220 次 |
| 最近记录: |