如何找到2个序列之间的重叠,并将其返回

ann*_*nne 11 python algorithm

我是Python新手,已经花了很多时间解决这个问题,希望有人可以帮助我.我需要找到2个序列之间的重叠.重叠位于第一个序列的左端,第二个序列的右端.我希望函数找到重叠,并返回它.

我的序列是:

s1 = "CGATTCCAGGCTCCCCACGGGGTACCCATAACTTGACAGTAGATCTC"
s2 = "GGCTCCCCACGGGGTACCCATAACTTGACAGTAGATCTCGTCCAGACCCCTAGC"
Run Code Online (Sandbox Code Playgroud)

我的功能应该命名

def getOverlap(left, right)
Run Code Online (Sandbox Code Playgroud)

随着s1作为左的顺序,以及s2是正确的.

结果应该是

‘GGCTCCCCACGGGGTACCCATAACTTGACAGTAGATCTC’
Run Code Online (Sandbox Code Playgroud)

任何帮助表示赞赏

mgi*_*son 12

你可以使用difflib.SequenceMatcher:

d = difflib.SequenceMatcher(None,s1,s2)
>>> match = max(d.get_matching_blocks(),key=lambda x:x[2])
>>> match
Match(a=8, b=0, size=39)
>>> i,j,k = match
>>> d.a[i:i+k]
'GGCTCCCCACGGGGTACCCATAACTTGACAGTAGATCTC'
>>> d.a[i:i+k] == d.b[j:j+k]
True
>>> d.a == s1
True
>>> d.b == s2
True
Run Code Online (Sandbox Code Playgroud)


Nic*_*las 11

看看difflib图书馆,更确切地说是find_longest_match():

import difflib

def get_overlap(s1, s2):
    s = difflib.SequenceMatcher(None, s1, s2)
    pos_a, pos_b, size = s.find_longest_match(0, len(s1), 0, len(s2)) 
    return s1[pos_a:pos_a+size]

s1 = "CGATTCCAGGCTCCCCACGGGGTACCCATAACTTGACAGTAGATCTC"
s2 = "GGCTCCCCACGGGGTACCCATAACTTGACAGTAGATCTCGTCCAGACCCCTAGC"

print(get_overlap(s1, s2)) # GGCTCCCCACGGGGTACCCATAACTTGACAGTAGATCTC
Run Code Online (Sandbox Code Playgroud)


mun*_*unk 5

最长公共子串

def LongestCommonSubstring(S1, S2):
  M = [[0]*(1+len(S2)) for i in xrange(1+len(S1))]
  longest, x_longest = 0, 0
  for x in xrange(1,1+len(S1)):
    for y in xrange(1,1+len(S2)):
        if S1[x-1] == S2[y-1]:
            M[x][y] = M[x-1][y-1] + 1
            if M[x][y]>longest:
                longest = M[x][y]
                x_longest  = x
        else:
            M[x][y] = 0
  return S1[x_longest-longest: x_longest]
Run Code Online (Sandbox Code Playgroud)


use*_*ser 5

克努特莫里斯普拉特算法是寻找内另一个字符串一个很好的方法(因为我看到了DNA,我猜你会喜欢这个扩展到...亿?).

# Knuth-Morris-Pratt string matching
# David Eppstein, UC Irvine, 1 Mar 2002

from __future__ import generators

def KnuthMorrisPratt(text, pattern):

    '''Yields all starting positions of copies of the pattern in the text.
Calling conventions are similar to string.find, but its arguments can be
lists or iterators, not just strings, it returns all matches, not just
the first one, and it does not need the whole text in memory at once.
Whenever it yields, it will have read the text exactly up to and including
the match that caused the yield.'''

    # allow indexing into pattern and protect against change during yield
    pattern = list(pattern)

    # build table of shift amounts
    shifts = [1] * (len(pattern) + 1)
    shift = 1
    for pos in range(len(pattern)):
        while shift <= pos and pattern[pos] != pattern[pos-shift]:
            shift += shifts[pos-shift]
        shifts[pos+1] = shift

    # do the actual search
    startPos = 0
    matchLen = 0
    for c in text:
        while matchLen == len(pattern) or \
              matchLen >= 0 and pattern[matchLen] != c:
            startPos += shifts[matchLen]
            matchLen -= shifts[matchLen]
        matchLen += 1
        if matchLen == len(pattern):
            yield startPos
Run Code Online (Sandbox Code Playgroud)

我获得KMP python代码的链接(以及内置,由于运行时常量,这对于小问题会更快).

对于前沿性能,使用字符串的前缀表和哈希窗口作为基本4整数(在生物学中,您将其称为k-mers或oligos).; )

祝好运!

编辑:还有一个很好的技巧,你可以对第一个字符串中的每个前缀(总共n)和第二个字符串中的每个前缀(n个总和)进行排序.如果它们共享最大的公共子序列,则它们必须在排序列表中相邻,因此从排序列表中最接近的另一个字符串中查找该元素,然后获取完全匹配的最长前缀.:)