更有效的最短超弦搜索算法

use*_*435 10 python algorithm optimization

下面我的问题是NP-complete,但是,我试图找到至少一个稍微快一点的字符串搜索功能或模块,这可能有助于减少与现在相比的一些计算时间.任何建议,将不胜感激.

连接(尽可能长的)超弦是:

AGGAGTCCGCGTGAGGGAGGTGTAGTGTAGTGG

下面的代码在16m中产生最短的超弦:

CCGTAGGTGGAGT

import itertools as it

def main():
    seqs = ['AGG', 'AGT', 'CCG', 'CGT', 'GAG', 'GGA', 'GGT', 'GTA', 'GTG', 'TAG', 'TGG']
    seq_perms = [''.join(perm) for perm in it.permutations(seqs)]
    for i in range(0, len(''.join(seqs))):
        seq_perms = [''.join(perm)[:i] for perm in it.permutations(seqs)]   
        for perm in seq_perms:   
            if all(perm.find(seq) != -1 for seq in seqs) == True:
                print 'Shortest superstring containing all strings:\n{}'.format(perm)
                return


if __name__ == '__main__':
    main()
Run Code Online (Sandbox Code Playgroud)

在我的系统上以较短的时间完成的任何重构都将被标记为已解决.

@InbarRose代码在我的机器上需要19秒.到目前为止最快.

Alf*_*lfe 48

我应用了Dijkstra算法(宽度搜索)并有一个解决方案,在不到一秒的时间内给出了这个任务的答案.我在内存使用方面对它进行了一些优化,但我认为对于算法来说,这是一种比其他答案更好的方法.除非我们内存不足,否则这应该是更好的解决方案.

from collections import defaultdict

def dijkSuperstring(originalSeqs):
  paths = defaultdict(set)
  paths[0] =  { '' }
  while paths:
    minLength = min(paths.keys())
    while paths[minLength]:
      candidate = paths[minLength].pop()
      seqAdded = False
      for seq in originalSeqs:
        if seq in candidate:
          continue
        seqAdded = True
        for i in reversed(range(len(seq)+1)):
          if candidate.endswith(seq[:i]):
            newCandidate = candidate + seq[i:]
            paths[len(newCandidate)].add(newCandidate)
      if not seqAdded:  # nothing added, so all present?
        return candidate
    del paths[minLength]

print dijkSuperstring(
  [ 'AGG', 'AGT', 'CCG', 'CGT', 'GAG', 'GGA', 'GGT', 'GTA', 'GTG', 'TAG', 'TGG' ])
Run Code Online (Sandbox Code Playgroud)

我也尝试使用随机序列作为输入:

seqs = [ ''.join(random.choice('GATC')
  for i in range(3))
    for j in range(11) ]
print dijkSuperstring(deqs)
Run Code Online (Sandbox Code Playgroud)

我很快发现解决时间很大程度上取决于结果的大小(!)而不是输入的大小(所以它是不可预测的).这并不太令人惊讶,但它使得比较不同算法有点困难,因为其他算法不一定也具有此属性.特别地,来自OP的序列集似乎构成相对轻量级的问题.其他一组11个3个字符的序列更难解决.

所以我做了一些统计测量; 我解决了1000组8个序列.这是我为3和4个字符的序列做的.然后我将持续时间分为100组(从0到最高持续时间等间隔)并计算每组中有多少人.为了使图表平滑,我总是使用三个相邻组的总和.

下面的图表显示了两个这样的实验,使用我的算法的早期(非优化)版本执行(但曲线的形状与现在相同); 我做了两次,至少知道图中的一个奇怪的沟是否有原因,或仅仅是纯粹的机会.

我有兴趣看到其他算法的相同类型输入的类似图表.这可能很有趣,因为我的算法显然存在内存问题.解决11个3个字符的序列由于内存耗尽而使我的机器多次停顿,因此即使它较慢,使用另一种算法也是有意义的.

8个3个字符的序列

8个3个字符的序列

8个4个字符的序列

8个4个字符的序列

  • 我尝试了Alfe的版本和Inbar的版本.他们都找到了相同的解决方案,分别是1s和29s.艾尔,我向你倾诉. (3认同)

Inb*_*ose 8

这应该做到这一点.

import itertools as it

SEQUENCES = ['AGG', 'AGT', 'CCG', 'CGT', 'GAG', 'GGA', 'GGT', 'GTA', 'GTG', 'TAG', 'TGG']
LONGEST_SUPERSTRING = ''.join(SEQUENCES)

def find_shortest_superstring():
    current_shortest = LONGEST_SUPERSTRING
    trim = len(current_shortest)-1
    seen_prefixes = set()
    for perm in it.permutations(SEQUENCES):
        candidate_string = ''.join(perm)[:trim]
        if candidate_string in seen_prefixes:
            continue
        seen_prefixes.add(candidate_string)
        while is_superstring(candidate_string):
            current_shortest = candidate_string
            candidate_string = candidate_string[:-1]
            trim = len(current_shortest)-1
    return current_shortest

def is_superstring(s):
    return all(seq in s for seq in SEQUENCES)

def main():
    print 'Searching for shortest superstring containing all strings.'
    ss = find_shortest_superstring()
    print 'Found shortest superstring containing all strings:\n{}'.format(ss)

if __name__ == '__main__':
    main()
Run Code Online (Sandbox Code Playgroud)

代码运行大约需要15秒,并产生以下输出:

Searching for shortest superstring containing all strings.
Found shortest superstring containing all strings:
CCGTAGGTGGAGT
Run Code Online (Sandbox Code Playgroud)


小智 2

只是回溯,但始终首先检查最重叠的部分。得到一个好的候选答案后,当字符串中当前路径结果的长度大于或等于该候选答案时,我们不需要进一步使用该路径。

\n\n

在我的 Jupyter 笔记本上进行了测试。它似乎比这里的其他两个答案快得多(11/18/2018)

\n\n
def shortestSuperstring(A):\n    """\n    :type A: List[str]\n    :rtype: str\n    """\n\n    if len(A)==1:\n        return A[0]\n    dic={}\n    for i in xrange(len(A)):\n        for j in xrange(len(A)):\n            if i!=j:\n                ol=0\n                for k in xrange(1,min(len(A[i]),len(A[j]))):\n                    if A[j][:k]==A[i][-k:]:\n                        ol=k\n                dic[(i,j)]=ol\n    if max(dic.values())==0:\n        return "".join(A)\n    else:\n        ret="".join(A)\n        l=len(ret)\n        stack=[]\n        for i,wd in enumerate(A):\n            tmp=set(range(len(A)))\n            tmp.remove(i)\n            stack.append((wd,i,tmp))\n        while stack:\n            ans,cur,remain=stack.pop()\n            if len(ans)<l:\n                if not remain:\n                    ret=ans\n                    l=len(ret)\n                else:\n                    tmp=[[dic[cur,idx],idx] for idx in remain] # [#overlap,idx]\n                    tmp.sort()\n                    for ol,idx in tmp:\n                        nans=ans+A[idx][ol:]\n                        nremain=set(remain)\n                        nremain.remove(idx)\n                        stack.append((nans,idx,nremain))\n        return ret\n
Run Code Online (Sandbox Code Playgroud)\n\n

问题中的测试用例为L

\n\n
1.93 s \xc2\xb1 160 ms per loop (mean \xc2\xb1 std. dev. of 7 runs, 1 loop each)\n
Run Code Online (Sandbox Code Playgroud)\n\n

运行并给出答案:

\n\n
\'CCGTGGTAGGAGT\'\n
Run Code Online (Sandbox Code Playgroud)\n\n

其他一些测试用例(较长的字符串并且开始击败其他两种方法,大约 1~5 秒):

\n\n
    ****************************************************************************************************\n\n\n    case: \n\n     [\'mftpvodataplkewcouz\', \'krrgsoxpsnmzlhprsl\', \'qhbfymytxzbmqma\', \'hunjgeaolcuznhpodi\', \'kewcouzbwlftz\', \'xzbmqmahunjgeaolcu\', \'zlhprslqurnqbhsjr\', \'rrgsoxpsnmzlhprslqur\', \'diqukrrgsoxpsnmz\', \'sjrxzavamftpvoda\']\n\n\n    ****************************************************************************************************\n\n\n    ans:  qhbfymytxzbmqmahunjgeaolcuznhpodiqukrrgsoxpsnmzlhprslqurnqbhsjrxzavamftpvodataplkewcouzbwlftz\n\n\n    ****************************************************************************************************\n\n\n    case: \n\n     [\'cedefifgstkyxfcuajfa\', \'ooncedefifgstkyxfcua\', \'assqjfwarvjcjedqtoz\', \'fcuajfassqjfwarvjc\', \'fwarvjcjedqtozctcd\', \'zppedxfumcfsngp\', \'kyxfcuajfassqjfwa\', \'fumcfsngphjyfhhwkqa\', \'fassqjfwarvjcjedq\', \'ppedxfumcfsngphjyf\', \'dqtozctcdk\']\n\n\n    ****************************************************************************************************\n\n\n    ans:  zppedxfumcfsngphjyfhhwkqaooncedefifgstkyxfcuajfassqjfwarvjcjedqtozctcdk\n\n\n    ****************************************************************************************************\n\n\n    case: \n\n     [\'ekpijtseahvmprvefkgn\', \'yyevvcmeekpijtseahvm\', \'vsfcyyevvcmeekp\', \'xwmkoqhxvrovlmmvsfcy\', \'cmeekpijtseahvmpr\', \'oqhxvrovlmmvsfcyy\', \'zpuemtclxbxwsypfxevx\', \'clxbxwsypfxevxw\', \'fkgnjgdvfygnlckyiju\', \'xevxwmkoqhxvrovlmm\']\n\n\n    ****************************************************************************************************\n\n\n    ans:  zpuemtclxbxwsypfxevxwmkoqhxvrovlmmvsfcyyevvcmeekpijtseahvmprvefkgnjgdvfygnlckyiju\n\n\n    ****************************************************************************************************\n\n\n    case: \n\n     [\'ppgortnmsy\', \'czmysoeeyugbiylso\', \'nbfzpppvhbjydtx\', \'rnzynedhoiunkpon\', \'ornzynedhoiunkpo\', \'ylsomoktkyfgljcf\', \'jtvkrornzynedhoiunk\', \'hvhhihwdffmxnczmyso\', \'ktkyfgljcfbkqcpp\', \'nzynedhoiunkponbfz\', \'nedhoiunkponbfzpppvh\']\n\n\n    ****************************************************************************************************\n\n\n    ans:  hvhhihwdffmxnczmysoeeyugbiylsomoktkyfgljcfbkqcppgortnmsyjtvkrornzynedhoiunkponbfzpppvhbjydtx\n\n\n    ****************************************************************************************************\n\n\n    case: \n\n     [\'amefulhsdgvjvoab\', \'giqxpqszaitzfzvtalx\', \'cyqeolfgkihssycmiodg\', \'glhhcfuprwazet\', \'cmiodgiqxpqszaitzf\', \'lhsdgvjvoabdviglhhcf\', \'ssycmiodgiqxpqsza\', \'bxtdqnamefulhsdg\', \'namefulhsdgvjvo\', \'ihssycmiodgiqxp\', \'itzfzvtalxfybxtdqn\']\n\n\n    ****************************************************************************************************\n\n\n    ans:  cyqeolfgkihssycmiodgiqxpqszaitzfzvtalxfybxtdqnamefulhsdgvjvoabdviglhhcfuprwazet\n\n\n    ****************************************************************************************************\n\n\n    case: \n\n     [\'yobbobwqymlordokxka\', \'jllfoebgbsrguls\', \'rgulsnatnpuuwiyba\', \'ordokxkamymamofefr\', \'wqymlordokxkamy\', \'fycxifzsjllfoebgbsrg\', \'lordokxkamymamofe\', \'kxkamymamofefrmfycx\', \'frmfycxifzsjllf\', \'srgulsnatnpuuwiy\']\n\n\n    ****************************************************************************************************\n\n\n    ans:  yobbobwqymlordokxkamymamofefrmfycxifzsjllfoebgbsrgulsnatnpuuwiyba\n\n\n    ****************************************************************************************************\n\n\n    case: \n\n     [\'jnbbbbsczcscxawcze\', \'bsczcscxawczeumyyr\', \'lyofvbhvjmquhkgz\', \'quhkgzyzdwtjnbbb\', \'kgzyzdwtjnbbbbsczc\', \'uouxnfplptpkgnronf\', \'pqgyfqglyofvbhv\', \'kgnronftgswvpqgy\', \'marvhdxtbmkcpnli\', \'qgyfqglyofvbhvjmquhk\', \'xtbmkcpnliz\']\n\n\n    ****************************************************************************************************\n\n\n    ans:  marvhdxtbmkcpnlizuouxnfplptpkgnronftgswvpqgyfqglyofvbhvjmquhkgzyzdwtjnbbbbsczcscxawczeumyyr\n\n\n    ****************************************************************************************************\n\n\n    case: \n\n     [\'qrwpawefqzfjsan\', \'jsanzdukfkdlmyox\', \'neaxnkedjxbpgsyq\', \'nqjvzryhfjdsxmwolwo\', \'hfjdsxmwolwomeeewvi\', \'lmyoxbpvkneaxnkedjxb\', \'qbhpqrwpawefqzfjsa\', \'pawefqzfjsanzdukfk\', \'bqbhpqrwpawefqzfj\', \'dlmyoxbpvkneaxnk\', \'xnkedjxbpgsyqovvh\']\n\n\n    ****************************************************************************************************\n\n\n    ans:  bqbhpqrwpawefqzfjsanzdukfkdlmyoxbpvkneaxnkedjxbpgsyqovvhnqjvzryhfjdsxmwolwomeeewvi\n\n\n    ****************************************************************************************************\n\n\n    case: \n\n     [\'vgrikrnwezryimj\', \'umwgwvzpsfpmctzt\', \'pjourlpgeemdjor\', \'urlpgeemdjorpzbkbz\', \'jorpzbkbzcqyewih\', \'xuwkzvoczozhhvf\', \'ihbumoogibirbsvch\', \'nwezryimjivvpjourlp\', \'kzvoczozhhvfwgeplv\', \'ezryimjivvpjourlpgee\', \'zhhvfwgeplvqngglu\', \'rikrnwezryimjivvp\']\n\n\n    ****************************************************************************************************\n\n\n    ans:  xuwkzvoczozhhvfwgeplvqngglumwgwvzpsfpmctztvgrikrnwezryimjivvpjourlpgeemdjorpzbkbzcqyewihbumoogibirbsvch\n\n\n    ****************************************************************************************************\n\n\n    case: \n\n     [\'nbsgonqmpreelpbr\', \'hnysjajtiguehrokus\', \'udgzbzmevnkzzba\', \'axtbmcpbmoubyoscn\', \'vqnbsgonqmpreel\', \'xvqnbsgonqmpree\', \'ajtiguehrokustktudgz\', \'brgkgihuetpqrhhbhn\', \'dgzbzmevnkzzbaxtbmcp\', \'ehrokustktudgzbzmevn\', \'uetpqrhhbhnysjaj\', \'vnkzzbaxtbmcpbmo\']\n\n\n    ****************************************************************************************************\n\n\n    ans:  xvqnbsgonqmpreelpbrgkgihuetpqrhhbhnysjajtiguehrokustktudgzbzmevnkzzbaxtbmcpbmoubyoscn\n\n\n    ****************************************************************************************************\n\n\n    case: \n\n     [\'orugbsuuxowmhjh\', \'zjyxzmpduthlsioor\', \'qtxocgehmhfqnstl\', \'tlrlcnnrsyryfrywuebq\', \'hozjyxzmpduthlsio\', \'hjhdmnqtxocgehm\', \'mjhzwdudlnbfkjawqacf\', \'hfqnstlrlcnnrsyryfry\', \'yfrywuebqhvwewzmq\', \'zzieemjhzwdudlnbfkj\', \'nnrsyryfrywuebqhvw\', \'acfgaihbhozjyxzmpdut\']\n\n\n    ****************************************************************************************************\n\n\n    ans:  zzieemjhzwdudlnbfkjawqacfgaihbhozjyxzmpduthlsioorugbsuuxowmhjhdmnqtxocgehmhfqnstlrlcnnrsyryfrywuebqhvwewzmq\n\n\n    ****************************************************************************************************\n\n\n    case: \n\n     [\'phuutlgczfspygaljkv\', \'fspygaljkvahvuii\', \'csywjodtnkynkjckq\', \'poyykqyrhbvcwvjl\', \'xijupvzzwphuutlg\', \'aljkvahvuiivqbqrw\', \'vahvuiivqbqrwryd\', \'wjodtnkynkjckqurgu\', \'ecdmbshotqbxjqgbou\', \'hvuiivqbqrwrydgnr\', \'ivqbqrwrydgnrubcsywj\', \'wphuutlgczfspyga\']\n\n\n    ****************************************************************************************************\n\n\n    ans:  ecdmbshotqbxjqgbouxijupvzzwphuutlgczfspygaljkvahvuiivqbqrwrydgnrubcsywjodtnkynkjckqurgupoyykqyrhbvcwvjl\n
Run Code Online (Sandbox Code Playgroud)\n\n

另请参阅动态规划方法:\n https://leetcode.com/problems/find-the-shortest-superstring/solution/

\n