查找由两个字符串共享的所有n字长子字符串的最大长度

Question

查找由两个字符串共享的所有n字长子字符串的最大长度

duh*_*ime 6 python string algorithm substring pattern-matching

我正在努力生成一个Python脚本,它可以找到由两个字符串共享的所有n字长子字符串的(尽可能长)长度,忽略尾随标点符号.鉴于两个字符串:

"这是一个示例字符串"

"这也是一个示例字符串"

我希望脚本能够识别这些字符串是否有一个共同的2个单词的序列("this is"),然后是一个共同的3个单词的序列("一个样本字符串").这是我目前的做法:

a = "this is a sample string"
b = "this is also a sample string"

aWords = a.split()
bWords = b.split()

#create counters to keep track of position in string
currentA = 0
currentB = 0

#create counter to keep track of longest sequence of matching words
matchStreak = 0

#create a list that contains all of the matchstreaks found
matchStreakList = []

#create binary switch to control the use of while loop
continueWhileLoop = 1

for word in aWords:
    currentA += 1

    if word == bWords[currentB]:
        matchStreak += 1

        #to avoid index errors, check to make sure we can move forward one unit in the b string before doing so
        if currentB + 1 < len(bWords):
            currentB += 1

        #in case we have two identical strings, check to see if we're at the end of string a. If we are, append value of match streak to list of match streaks
        if currentA == len(aWords):
            matchStreakList.append(matchStreak)

    elif word != bWords[currentB]:

        #because the streak is broken, check to see if the streak is >= 1. If it is, append the streak counter to out list of streaks and then reset the counter
        if matchStreak >= 1:
            matchStreakList.append(matchStreak)
        matchStreak = 0

        while word != bWords[currentB]:

            #the two words don't match. If you can move b forward one word, do so, then check for another match
            if currentB + 1 < len(bWords):
                currentB += 1

            #if you have advanced b all the way to the end of string b, then rewind to the beginning of string b and advance a, looking for more matches
            elif currentB + 1 == len(bWords):
                currentB = 0
                break

        if word == bWords[currentB]:
            matchStreak += 1

            #now that you have a match, check to see if you can advance b. If you can, do so. Else, rewind b to the beginning
            if currentB + 1 < len(bWords):
                currentB += 1
            elif currentB + 1 == len(bWords):

                #we're at the end of string b. If we are also at the end of string a, check to see if the value of matchStreak >= 1. If so, add matchStreak to matchStreakList
                if currentA == len(aWords):
                    matchStreakList.append(matchStreak)
                currentB = 0
                break

print matchStreakList

Run Code Online (Sandbox Code Playgroud)

这个脚本正确地输出了公共字长子串(2,3)的(最大)长度,并且到目前为止对所有测试都这样做了.我的问题是:是否有一对两个字符串,上面的方法不起作用？更重要的是:是否存在可用于查找两个字符串共享的所有n字长子串的最大长度的现有Python库或众所周知的方法？

[这个问题不同于最常见的子串问题,这只是我正在寻找的特殊情况(因为我想找到所有常见的子串,而不仅仅是最长的子串).这篇SO帖子提出了诸如1)聚类分析,2)编辑距离例程和3)最长公共序列算法等方法可能是合适的方法,但我没有找到任何有效的解决方案,我的问题可能稍微容易一些在链接中提到因为我正在处理由空格限定的单词.

编辑:

我正在开始对这个问题给予赏金.如果它会帮助其他人,我想澄清一些快速点.首先,下面由@DhruvPathak提出的有用答案并未找到由两个字符串共享的所有最大长度的n字长子字符串.例如,假设我们分析的两个字符串是:

"当他们第一次出生的时候,他们都是白色的一张一尘不染的纸,但是他们要被潦草地写在每根鹅毛笔上"

和

"当你第一次出生的时候,你们都是白色的,一张可爱的,一尘不染的纸;但是你会被每只鹅的羽毛笔潦草地涂上"

在这种情况下,最大长n字长子串的列表(忽略尾随标点符号)是:

all
are
white a sheet of
spotless paper when
first are born but
are to be scrawled
and blotted by every

Run Code Online (Sandbox Code Playgroud)

使用以下例程:

#import required packages
import difflib

#define function we'll use to identify matches
def matches(first_string,second_string):
    s = difflib.SequenceMatcher(None, first_string,second_string)
    match = [first_string[i:i+n] for i, j, n in s.get_matching_blocks() if n > 0]
    return match

a = "They all are white a sheet of spotless paper when they first are born but they are to be scrawled upon and blotted by every goose quill"
b = "You are all white, a sheet of lovely, spotless paper, when you first are born; but you are to be scrawled and blotted by every goose's quill"

a = a.replace(",", "").replace(":","").replace("!","").replace("'","").replace(";","").lower()
b = b.replace(",", "").replace(":","").replace("!","").replace("'","").replace(";","").lower()

print matches(a,b)

Run Code Online (Sandbox Code Playgroud)

一个输出:

['e', ' all', ' white a sheet of', ' spotless paper when ', 'y', ' first are born but ', 'y', ' are to be scrawled', ' and blotted by every goose', ' quill']

Run Code Online (Sandbox Code Playgroud)

首先,我不确定如何从这个列表中选择只包含整个单词的子串.在第二位,该列表不包括"是",这是所需的最大长公共n字长子串之一.有没有一种方法可以找到这两个字符串共享的所有最长n字长子串("你都是......"和"它们都是......")？

Answer 1

Tim*_*ers 5

这里仍然含糊不清,我不想花时间争论它们.但我认为无论如何我都可以添加一些有用的东西;-)

我编写了Python difflib.SequenceMatcher,并花了很多时间找到预期的快速方法来找到最长的公共子串.从理论上讲,应该使用"后缀树"或相关的"后缀数组"来增加"最长公共前缀数组"(引号中的短语是搜索术语,如果你想谷歌更多).那些可以在最坏情况线性时间内解决问题.但是,正如有时的情况下,最坏情况下的线性时间的算法是极度复杂和微妙,并受到大的恒定因素-他们仍然可以还清巨大的,如果给定的语料库是要被搜索很多次,但是这不是Python的典型案例difflib,它看起来也不像你的情况.

总之,我在这里的贡献是重写SequenceMatcher的find_longest_match()方法返回所有的(本地),它沿途发现的最大匹配.笔记:

我将使用to_words()Raymond Hettinger给你的功能,但没有转换为小写.转换为小写会导致输出与您所说的不完全相同.
然而,正如我在评论中已经指出的那样,这确实会输出"quill",这不在您想要的输出列表中.我不知道它为什么不是,因为"quill"确实出现在两个输入中.

这是代码:

import re
def to_words(text):
    'Break text into a list of words without punctuation'
    return re.findall(r"[a-zA-Z']+", text)

def match(a, b):
    # Make b the longer list.
    if len(a) > len(b):
        a, b = b, a
    # Map each word of b to a list of indices it occupies.
    b2j = {}
    for j, word in enumerate(b):
        b2j.setdefault(word, []).append(j)
    j2len = {}
    nothing = []
    unique = set() # set of all results
    def local_max_at_j(j):
        # maximum match ends with b[j], with length j2len[j]
        length = j2len[j]
        unique.add(" ".join(b[j-length+1: j+1]))
    # during an iteration of the loop, j2len[j] = length of longest
    # match ending with b[j] and the previous word in a
    for word in a:
        # look at all instances of word in b
        j2lenget = j2len.get
        newj2len = {}
        for j in b2j.get(word, nothing):
            newj2len[j] = j2lenget(j-1, 0) + 1
        # which indices have not been extended?  those are
        # (local) maximums
        for j in j2len:
            if j+1 not in newj2len:
                local_max_at_j(j)
        j2len = newj2len
    # and we may also have local maximums ending at the last word
    for j in j2len:
        local_max_at_j(j)
    return unique

Run Code Online (Sandbox Code Playgroud)

然后:

a = "They all are white a sheet of spotless paper " \
    "when they first are born but they are to be " \
    "scrawled upon and blotted by every goose quill"
b = "You are all white, a sheet of lovely, spotless " \
    "paper, when you first are born; but you are to " \
    "be scrawled and blotted by every goose's quill"

print match(to_words(a), to_words(b))

Run Code Online (Sandbox Code Playgroud)

显示:

set(['all',
     'and blotted by every',
     'first are born but',
     'are to be scrawled',
     'are',
     'spotless paper when',
     'white a sheet of',
     'quill'])

Run Code Online (Sandbox Code Playgroud)

编辑 - 它是如何工作的

许多序列匹配和对齐算法最好被理解为在二维矩阵上工作,具有用于计算矩阵条目并随后解释条目含义的规则.

对于输入序列,a并b绘制M具有len(a)行和len(b)列的矩阵.在这个应用程序中,我们希望M[i, j]包含以a[i]和结尾的最长公共连续子序列的长度b[j],并且计算规则非常简单:

M[i, j] = 0如果a[i] != b[j].
M[i, j] = M[i-1, j-1] + 1if a[i] == b[j](我们假设一个越界矩阵引用静默返回0).

在这种情况下,解释也很容易:有一个局部最大的非空匹配结束于a[i]和b[j],长度M[i, j],当且仅当它M[i, j]是非零但是M[i+1, j+1]0或超出界限时.

您可以使用这些规则编写非常简单和紧凑的代码,并使用两个循环来M正确计算此问题.缺点是代码将采用(最佳,平均和最差情况)O(len(a) * len(b))时间和空间.

虽然起初可能令人困惑,但我发布的代码正是如上所述.连接是模糊的,因为对于预期的情况,代码在几个方面进行了大量优化:

而不是做一次计算M,然后另一次解释结果,计算和解释的传递在一次传递中交错a.
因此,不需要存储整个矩阵.而是仅同时存在当前行(newj2len)和前一行(j2len).
并且因为此问题中的矩阵通常大多为零,所以这里的行通过dict将列索引映射到非零值来稀疏地表示.零条目是"免费的",因为它们永远不会被明确存储.
当处理一排,没有必要遍历每个列:预先计算的b2j字典告诉我们,正是当前行中有趣的列索引(那些符合当前这列word从a).
最后,部分是偶然的,所有先前的优化都以这样的方式合谋,即从来不需要知道当前行的索引,因此我们也不必费心计算.

编辑 - 污垢简单版本

这是直接实现2D矩阵的代码,没有尝试进行优化(除此之外,Counter通常可以避免显式存储0个条目).它非常简单,简单,容易:

def match(a, b):
    from collections import Counter
    M = Counter()
    for i in range(len(a)):
        for j in range(len(b)):
            if a[i] == b[j]:
                M[i, j] = M[i-1, j-1] + 1
    unique = set()
    for i in range(len(a)):
        for j in range(len(b)):
            if M[i, j] and not M[i+1, j+1]:
                length = M[i, j]
                unique.add(" ".join(a[i+1-length: i+1]))
    return unique

Run Code Online (Sandbox Code Playgroud)

当然;-)返回与match()我最初发布的优化结果相同的结果.

编辑 - 另一个没有字典

只是为了好玩:-)如果你有矩阵模型,这个代码将很容易遵循.关于这个特殊问题的一个值得注意的事情是矩阵单元的值仅取决于沿着对角线到单元格西北方向的值.所以它"足够好"只是为了遍历所有主要对角线,从西边界和北边界的所有细胞向东南方向前进.这样,无论输入的长度如何,都只需要很小的恒定空间:

def match(a, b):
    from itertools import chain
    m, n = len(a), len(b)
    unique = set()
    for i, j in chain(((i, 0) for i in xrange(m)),
                      ((0, j) for j in xrange(1, n))):
        k = 0
        while i < m and j < n:
            if a[i] == b[j]:
                k += 1
            elif k:
                unique.add(" ".join(a[i-k: i]))
                k = 0
            i += 1
            j += 1
        if k:
            unique.add(" ".join(a[i-k: i]))
    return unique

Run Code Online (Sandbox Code Playgroud)

归档时间：	12 年，1 月前
查看次数：	2166 次
最近记录：	12 年前