use*_*435 10 python algorithm optimization
下面我的问题是NP-complete,但是,我试图找到至少一个稍微快一点的字符串搜索功能或模块,这可能有助于减少与现在相比的一些计算时间.任何建议,将不胜感激.
连接(尽可能长的)超弦是:
AGGAGTCCGCGTGAGGGAGGTGTAGTGTAGTGG
下面的代码在16m中产生最短的超弦:
CCGTAGGTGGAGT
import itertools as it
def main():
seqs = ['AGG', 'AGT', 'CCG', 'CGT', 'GAG', 'GGA', 'GGT', 'GTA', 'GTG', 'TAG', 'TGG']
seq_perms = [''.join(perm) for perm in it.permutations(seqs)]
for i in range(0, len(''.join(seqs))):
seq_perms = [''.join(perm)[:i] for perm in it.permutations(seqs)]
for perm in seq_perms:
if all(perm.find(seq) != -1 for seq in seqs) == True:
print 'Shortest superstring containing all strings:\n{}'.format(perm)
return
if __name__ == '__main__':
main()
Run Code Online (Sandbox Code Playgroud)
在我的系统上以较短的时间完成的任何重构都将被标记为已解决.
@InbarRose代码在我的机器上需要19秒.到目前为止最快.
Alf*_*lfe 48
我应用了Dijkstra算法(宽度搜索)并有一个解决方案,在不到一秒的时间内给出了这个任务的答案.我在内存使用方面对它进行了一些优化,但我认为对于算法来说,这是一种比其他答案更好的方法.除非我们内存不足,否则这应该是更好的解决方案.
from collections import defaultdict
def dijkSuperstring(originalSeqs):
paths = defaultdict(set)
paths[0] = { '' }
while paths:
minLength = min(paths.keys())
while paths[minLength]:
candidate = paths[minLength].pop()
seqAdded = False
for seq in originalSeqs:
if seq in candidate:
continue
seqAdded = True
for i in reversed(range(len(seq)+1)):
if candidate.endswith(seq[:i]):
newCandidate = candidate + seq[i:]
paths[len(newCandidate)].add(newCandidate)
if not seqAdded: # nothing added, so all present?
return candidate
del paths[minLength]
print dijkSuperstring(
[ 'AGG', 'AGT', 'CCG', 'CGT', 'GAG', 'GGA', 'GGT', 'GTA', 'GTG', 'TAG', 'TGG' ])
Run Code Online (Sandbox Code Playgroud)
我也尝试使用随机序列作为输入:
seqs = [ ''.join(random.choice('GATC')
for i in range(3))
for j in range(11) ]
print dijkSuperstring(deqs)
Run Code Online (Sandbox Code Playgroud)
我很快发现解决时间很大程度上取决于结果的大小(!)而不是输入的大小(所以它是不可预测的).这并不太令人惊讶,但它使得比较不同算法有点困难,因为其他算法不一定也具有此属性.特别地,来自OP的序列集似乎构成相对轻量级的问题.其他一组11个3个字符的序列更难解决.
所以我做了一些统计测量; 我解决了1000组8个序列.这是我为3和4个字符的序列做的.然后我将持续时间分为100组(从0到最高持续时间等间隔)并计算每组中有多少人.为了使图表平滑,我总是使用三个相邻组的总和.
下面的图表显示了两个这样的实验,使用我的算法的早期(非优化)版本执行(但曲线的形状与现在相同); 我做了两次,至少知道图中的一个奇怪的沟是否有原因,或仅仅是纯粹的机会.
我有兴趣看到其他算法的相同类型输入的类似图表.这可能很有趣,因为我的算法显然存在内存问题.解决11个3个字符的序列由于内存耗尽而使我的机器多次停顿,因此即使它较慢,使用另一种算法也是有意义的.


这应该做到这一点.
import itertools as it
SEQUENCES = ['AGG', 'AGT', 'CCG', 'CGT', 'GAG', 'GGA', 'GGT', 'GTA', 'GTG', 'TAG', 'TGG']
LONGEST_SUPERSTRING = ''.join(SEQUENCES)
def find_shortest_superstring():
current_shortest = LONGEST_SUPERSTRING
trim = len(current_shortest)-1
seen_prefixes = set()
for perm in it.permutations(SEQUENCES):
candidate_string = ''.join(perm)[:trim]
if candidate_string in seen_prefixes:
continue
seen_prefixes.add(candidate_string)
while is_superstring(candidate_string):
current_shortest = candidate_string
candidate_string = candidate_string[:-1]
trim = len(current_shortest)-1
return current_shortest
def is_superstring(s):
return all(seq in s for seq in SEQUENCES)
def main():
print 'Searching for shortest superstring containing all strings.'
ss = find_shortest_superstring()
print 'Found shortest superstring containing all strings:\n{}'.format(ss)
if __name__ == '__main__':
main()
Run Code Online (Sandbox Code Playgroud)
代码运行大约需要15秒,并产生以下输出:
Searching for shortest superstring containing all strings.
Found shortest superstring containing all strings:
CCGTAGGTGGAGT
Run Code Online (Sandbox Code Playgroud)
小智 2
只是回溯,但始终首先检查最重叠的部分。得到一个好的候选答案后,当字符串中当前路径结果的长度大于或等于该候选答案时,我们不需要进一步使用该路径。
\n\n在我的 Jupyter 笔记本上进行了测试。它似乎比这里的其他两个答案快得多(11/18/2018)
\n\ndef shortestSuperstring(A):\n """\n :type A: List[str]\n :rtype: str\n """\n\n if len(A)==1:\n return A[0]\n dic={}\n for i in xrange(len(A)):\n for j in xrange(len(A)):\n if i!=j:\n ol=0\n for k in xrange(1,min(len(A[i]),len(A[j]))):\n if A[j][:k]==A[i][-k:]:\n ol=k\n dic[(i,j)]=ol\n if max(dic.values())==0:\n return "".join(A)\n else:\n ret="".join(A)\n l=len(ret)\n stack=[]\n for i,wd in enumerate(A):\n tmp=set(range(len(A)))\n tmp.remove(i)\n stack.append((wd,i,tmp))\n while stack:\n ans,cur,remain=stack.pop()\n if len(ans)<l:\n if not remain:\n ret=ans\n l=len(ret)\n else:\n tmp=[[dic[cur,idx],idx] for idx in remain] # [#overlap,idx]\n tmp.sort()\n for ol,idx in tmp:\n nans=ans+A[idx][ol:]\n nremain=set(remain)\n nremain.remove(idx)\n stack.append((nans,idx,nremain))\n return ret\nRun Code Online (Sandbox Code Playgroud)\n\n问题中的测试用例为L
\n\n1.93 s \xc2\xb1 160 ms per loop (mean \xc2\xb1 std. dev. of 7 runs, 1 loop each)\nRun Code Online (Sandbox Code Playgroud)\n\n运行并给出答案:
\n\n\'CCGTGGTAGGAGT\'\nRun Code Online (Sandbox Code Playgroud)\n\n其他一些测试用例(较长的字符串并且开始击败其他两种方法,大约 1~5 秒):
\n\n ****************************************************************************************************\n\n\n case: \n\n [\'mftpvodataplkewcouz\', \'krrgsoxpsnmzlhprsl\', \'qhbfymytxzbmqma\', \'hunjgeaolcuznhpodi\', \'kewcouzbwlftz\', \'xzbmqmahunjgeaolcu\', \'zlhprslqurnqbhsjr\', \'rrgsoxpsnmzlhprslqur\', \'diqukrrgsoxpsnmz\', \'sjrxzavamftpvoda\']\n\n\n ****************************************************************************************************\n\n\n ans: qhbfymytxzbmqmahunjgeaolcuznhpodiqukrrgsoxpsnmzlhprslqurnqbhsjrxzavamftpvodataplkewcouzbwlftz\n\n\n ****************************************************************************************************\n\n\n case: \n\n [\'cedefifgstkyxfcuajfa\', \'ooncedefifgstkyxfcua\', \'assqjfwarvjcjedqtoz\', \'fcuajfassqjfwarvjc\', \'fwarvjcjedqtozctcd\', \'zppedxfumcfsngp\', \'kyxfcuajfassqjfwa\', \'fumcfsngphjyfhhwkqa\', \'fassqjfwarvjcjedq\', \'ppedxfumcfsngphjyf\', \'dqtozctcdk\']\n\n\n ****************************************************************************************************\n\n\n ans: zppedxfumcfsngphjyfhhwkqaooncedefifgstkyxfcuajfassqjfwarvjcjedqtozctcdk\n\n\n ****************************************************************************************************\n\n\n case: \n\n [\'ekpijtseahvmprvefkgn\', \'yyevvcmeekpijtseahvm\', \'vsfcyyevvcmeekp\', \'xwmkoqhxvrovlmmvsfcy\', \'cmeekpijtseahvmpr\', \'oqhxvrovlmmvsfcyy\', \'zpuemtclxbxwsypfxevx\', \'clxbxwsypfxevxw\', \'fkgnjgdvfygnlckyiju\', \'xevxwmkoqhxvrovlmm\']\n\n\n ****************************************************************************************************\n\n\n ans: zpuemtclxbxwsypfxevxwmkoqhxvrovlmmvsfcyyevvcmeekpijtseahvmprvefkgnjgdvfygnlckyiju\n\n\n ****************************************************************************************************\n\n\n case: \n\n [\'ppgortnmsy\', \'czmysoeeyugbiylso\', \'nbfzpppvhbjydtx\', \'rnzynedhoiunkpon\', \'ornzynedhoiunkpo\', \'ylsomoktkyfgljcf\', \'jtvkrornzynedhoiunk\', \'hvhhihwdffmxnczmyso\', \'ktkyfgljcfbkqcpp\', \'nzynedhoiunkponbfz\', \'nedhoiunkponbfzpppvh\']\n\n\n ****************************************************************************************************\n\n\n ans: hvhhihwdffmxnczmysoeeyugbiylsomoktkyfgljcfbkqcppgortnmsyjtvkrornzynedhoiunkponbfzpppvhbjydtx\n\n\n ****************************************************************************************************\n\n\n case: \n\n [\'amefulhsdgvjvoab\', \'giqxpqszaitzfzvtalx\', \'cyqeolfgkihssycmiodg\', \'glhhcfuprwazet\', \'cmiodgiqxpqszaitzf\', \'lhsdgvjvoabdviglhhcf\', \'ssycmiodgiqxpqsza\', \'bxtdqnamefulhsdg\', \'namefulhsdgvjvo\', \'ihssycmiodgiqxp\', \'itzfzvtalxfybxtdqn\']\n\n\n ****************************************************************************************************\n\n\n ans: cyqeolfgkihssycmiodgiqxpqszaitzfzvtalxfybxtdqnamefulhsdgvjvoabdviglhhcfuprwazet\n\n\n ****************************************************************************************************\n\n\n case: \n\n [\'yobbobwqymlordokxka\', \'jllfoebgbsrguls\', \'rgulsnatnpuuwiyba\', \'ordokxkamymamofefr\', \'wqymlordokxkamy\', \'fycxifzsjllfoebgbsrg\', \'lordokxkamymamofe\', \'kxkamymamofefrmfycx\', \'frmfycxifzsjllf\', \'srgulsnatnpuuwiy\']\n\n\n ****************************************************************************************************\n\n\n ans: yobbobwqymlordokxkamymamofefrmfycxifzsjllfoebgbsrgulsnatnpuuwiyba\n\n\n ****************************************************************************************************\n\n\n case: \n\n [\'jnbbbbsczcscxawcze\', \'bsczcscxawczeumyyr\', \'lyofvbhvjmquhkgz\', \'quhkgzyzdwtjnbbb\', \'kgzyzdwtjnbbbbsczc\', \'uouxnfplptpkgnronf\', \'pqgyfqglyofvbhv\', \'kgnronftgswvpqgy\', \'marvhdxtbmkcpnli\', \'qgyfqglyofvbhvjmquhk\', \'xtbmkcpnliz\']\n\n\n ****************************************************************************************************\n\n\n ans: marvhdxtbmkcpnlizuouxnfplptpkgnronftgswvpqgyfqglyofvbhvjmquhkgzyzdwtjnbbbbsczcscxawczeumyyr\n\n\n ****************************************************************************************************\n\n\n case: \n\n [\'qrwpawefqzfjsan\', \'jsanzdukfkdlmyox\', \'neaxnkedjxbpgsyq\', \'nqjvzryhfjdsxmwolwo\', \'hfjdsxmwolwomeeewvi\', \'lmyoxbpvkneaxnkedjxb\', \'qbhpqrwpawefqzfjsa\', \'pawefqzfjsanzdukfk\', \'bqbhpqrwpawefqzfj\', \'dlmyoxbpvkneaxnk\', \'xnkedjxbpgsyqovvh\']\n\n\n ****************************************************************************************************\n\n\n ans: bqbhpqrwpawefqzfjsanzdukfkdlmyoxbpvkneaxnkedjxbpgsyqovvhnqjvzryhfjdsxmwolwomeeewvi\n\n\n ****************************************************************************************************\n\n\n case: \n\n [\'vgrikrnwezryimj\', \'umwgwvzpsfpmctzt\', \'pjourlpgeemdjor\', \'urlpgeemdjorpzbkbz\', \'jorpzbkbzcqyewih\', \'xuwkzvoczozhhvf\', \'ihbumoogibirbsvch\', \'nwezryimjivvpjourlp\', \'kzvoczozhhvfwgeplv\', \'ezryimjivvpjourlpgee\', \'zhhvfwgeplvqngglu\', \'rikrnwezryimjivvp\']\n\n\n ****************************************************************************************************\n\n\n ans: xuwkzvoczozhhvfwgeplvqngglumwgwvzpsfpmctztvgrikrnwezryimjivvpjourlpgeemdjorpzbkbzcqyewihbumoogibirbsvch\n\n\n ****************************************************************************************************\n\n\n case: \n\n [\'nbsgonqmpreelpbr\', \'hnysjajtiguehrokus\', \'udgzbzmevnkzzba\', \'axtbmcpbmoubyoscn\', \'vqnbsgonqmpreel\', \'xvqnbsgonqmpree\', \'ajtiguehrokustktudgz\', \'brgkgihuetpqrhhbhn\', \'dgzbzmevnkzzbaxtbmcp\', \'ehrokustktudgzbzmevn\', \'uetpqrhhbhnysjaj\', \'vnkzzbaxtbmcpbmo\']\n\n\n ****************************************************************************************************\n\n\n ans: xvqnbsgonqmpreelpbrgkgihuetpqrhhbhnysjajtiguehrokustktudgzbzmevnkzzbaxtbmcpbmoubyoscn\n\n\n ****************************************************************************************************\n\n\n case: \n\n [\'orugbsuuxowmhjh\', \'zjyxzmpduthlsioor\', \'qtxocgehmhfqnstl\', \'tlrlcnnrsyryfrywuebq\', \'hozjyxzmpduthlsio\', \'hjhdmnqtxocgehm\', \'mjhzwdudlnbfkjawqacf\', \'hfqnstlrlcnnrsyryfry\', \'yfrywuebqhvwewzmq\', \'zzieemjhzwdudlnbfkj\', \'nnrsyryfrywuebqhvw\', \'acfgaihbhozjyxzmpdut\']\n\n\n ****************************************************************************************************\n\n\n ans: zzieemjhzwdudlnbfkjawqacfgaihbhozjyxzmpduthlsioorugbsuuxowmhjhdmnqtxocgehmhfqnstlrlcnnrsyryfrywuebqhvwewzmq\n\n\n ****************************************************************************************************\n\n\n case: \n\n [\'phuutlgczfspygaljkv\', \'fspygaljkvahvuii\', \'csywjodtnkynkjckq\', \'poyykqyrhbvcwvjl\', \'xijupvzzwphuutlg\', \'aljkvahvuiivqbqrw\', \'vahvuiivqbqrwryd\', \'wjodtnkynkjckqurgu\', \'ecdmbshotqbxjqgbou\', \'hvuiivqbqrwrydgnr\', \'ivqbqrwrydgnrubcsywj\', \'wphuutlgczfspyga\']\n\n\n ****************************************************************************************************\n\n\n ans: ecdmbshotqbxjqgbouxijupvzzwphuutlgczfspygaljkvahvuiivqbqrwrydgnrubcsywjodtnkynkjckqurgupoyykqyrhbvcwvjl\nRun Code Online (Sandbox Code Playgroud)\n\n另请参阅动态规划方法:\n https://leetcode.com/problems/find-the-shortest-superstring/solution/
\n