找到Python最长重复字符串的有效方法(From Programming Pearls)

Han*_*Sun 11 c python suffix-tree suffix-array programming-pearls

摘自编程珍珠第15.2节

可在此处查看C代码:http://www.cs.bell-labs.com/cm/cs/pearls/longdup.c

当我使用suffix-array在Python中实现它时:

example = open("iliad10.txt").read()
def comlen(p, q):
    i = 0
    for x in zip(p, q):
        if x[0] == x[1]:
            i += 1
        else:
            break
    return i

suffix_list = []
example_len = len(example)
idx = list(range(example_len))
idx.sort(cmp = lambda a, b: cmp(example[a:], example[b:]))  #VERY VERY SLOW

max_len = -1
for i in range(example_len - 1):
    this_len = comlen(example[idx[i]:], example[idx[i+1]:])
    print this_len
    if this_len > max_len:
        max_len = this_len
        maxi = i
Run Code Online (Sandbox Code Playgroud)

我发现这idx.sort一步很慢.我认为它很慢,因为Python需要通过值而不是指针传递子串(如上面的C代码).

测试文件可以从这里下载

C代码只需0.3秒即可完成.

time cat iliad10.txt |./longdup 
On this the rest of the Achaeans with one voice were for
respecting the priest and taking the ransom that he offered; but
not so Agamemnon, who spoke fiercely to him and sent him roughly
away. 

real    0m0.328s
user    0m0.291s
sys 0m0.006s
Run Code Online (Sandbox Code Playgroud)

但是对于Python代码,它永远不会在我的计算机上结束(我等了10分钟并将其杀死)

有没有人有想法如何使代码有效?(例如,不到10秒)

hyn*_*cer 13

我的解决方案基于Suffix阵列.它由最长公共前缀前缀加倍构成.最坏情况的复杂度是O(n(log n)^ 2).我的笔记本电脑上的任务"iliad.mb.txt"需要4秒钟.该代码是很好的记录里面的功能和.后一种功能很短,可以很容易地修改,例如用于搜索10个最长的非重叠子串. 如果重复的字符串超过10000个字符,则此Python代码比问题中的原始C代码(此处复制)更快.suffix_arraylongest_common_substring

from itertools import groupby
from operator import itemgetter

def longest_common_substring(text):
    """Get the longest common substrings and their positions.
    >>> longest_common_substring('banana')
    {'ana': [1, 3]}
    >>> text = "not so Agamemnon, who spoke fiercely to "
    >>> sorted(longest_common_substring(text).items())
    [(' s', [3, 21]), ('no', [0, 13]), ('o ', [5, 20, 38])]

    This function can be easy modified for any criteria, e.g. for searching ten
    longest non overlapping repeated substrings.
    """
    sa, rsa, lcp = suffix_array(text)
    maxlen = max(lcp)
    result = {}
    for i in range(1, len(text)):
        if lcp[i] == maxlen:
            j1, j2, h = sa[i - 1], sa[i], lcp[i]
            assert text[j1:j1 + h] == text[j2:j2 + h]
            substring = text[j1:j1 + h]
            if not substring in result:
                result[substring] = [j1]
            result[substring].append(j2)
    return dict((k, sorted(v)) for k, v in result.items())

def suffix_array(text, _step=16):
    """Analyze all common strings in the text.

    Short substrings of the length _step a are first pre-sorted. The are the 
    results repeatedly merged so that the garanteed number of compared
    characters bytes is doubled in every iteration until all substrings are
    sorted exactly.

    Arguments:
        text:  The text to be analyzed.
        _step: Is only for optimization and testing. It is the optimal length
               of substrings used for initial pre-sorting. The bigger value is
               faster if there is enough memory. Memory requirements are
               approximately (estimate for 32 bit Python 3.3):
                   len(text) * (29 + (_size + 20 if _size > 2 else 0)) + 1MB

    Return value:      (tuple)
      (sa, rsa, lcp)
        sa:  Suffix array                  for i in range(1, size):
               assert text[sa[i-1]:] < text[sa[i]:]
        rsa: Reverse suffix array          for i in range(size):
               assert rsa[sa[i]] == i
        lcp: Longest common prefix         for i in range(1, size):
               assert text[sa[i-1]:sa[i-1]+lcp[i]] == text[sa[i]:sa[i]+lcp[i]]
               if sa[i-1] + lcp[i] < len(text):
                   assert text[sa[i-1] + lcp[i]] < text[sa[i] + lcp[i]]
    >>> suffix_array(text='banana')
    ([5, 3, 1, 0, 4, 2], [3, 2, 5, 1, 4, 0], [0, 1, 3, 0, 0, 2])

    Explanation: 'a' < 'ana' < 'anana' < 'banana' < 'na' < 'nana'
    The Longest Common String is 'ana': lcp[2] == 3 == len('ana')
    It is between  tx[sa[1]:] == 'ana' < 'anana' == tx[sa[2]:]
    """
    tx = text
    size = len(tx)
    step = min(max(_step, 1), len(tx))
    sa = list(range(len(tx)))
    sa.sort(key=lambda i: tx[i:i + step])
    grpstart = size * [False] + [True]  # a boolean map for iteration speedup.
    # It helps to skip yet resolved values. The last value True is a sentinel.
    rsa = size * [None]
    stgrp, igrp = '', 0
    for i, pos in enumerate(sa):
        st = tx[pos:pos + step]
        if st != stgrp:
            grpstart[igrp] = (igrp < i - 1)
            stgrp = st
            igrp = i
        rsa[pos] = igrp
        sa[i] = pos
    grpstart[igrp] = (igrp < size - 1 or size == 0)
    while grpstart.index(True) < size:
        # assert step <= size
        nextgr = grpstart.index(True)
        while nextgr < size:
            igrp = nextgr
            nextgr = grpstart.index(True, igrp + 1)
            glist = []
            for ig in range(igrp, nextgr):
                pos = sa[ig]
                if rsa[pos] != igrp:
                    break
                newgr = rsa[pos + step] if pos + step < size else -1
                glist.append((newgr, pos))
            glist.sort()
            for ig, g in groupby(glist, key=itemgetter(0)):
                g = [x[1] for x in g]
                sa[igrp:igrp + len(g)] = g
                grpstart[igrp] = (len(g) > 1)
                for pos in g:
                    rsa[pos] = igrp
                igrp += len(g)
        step *= 2
    del grpstart
    # create LCP array
    lcp = size * [None]
    h = 0
    for i in range(size):
        if rsa[i] > 0:
            j = sa[rsa[i] - 1]
            while i != size - h and j != size - h and tx[i + h] == tx[j + h]:
                h += 1
            lcp[rsa[i]] = h
            if h > 0:
                h -= 1
    if size > 0:
        lcp[0] = 0
    return sa, rsa, lcp
Run Code Online (Sandbox Code Playgroud)

我更喜欢这个解决方案而不是更复杂的O(n log n),因为Python有一个非常快速的列表排序(list.sort),可能比该文章的方法中必要的线性时间操作更快,应该是非常的O(n)随机字符串的特殊推定和小字母表(典型的DNA基因组分析).我在Gog 2011中读到,我的算法的最坏情况O(n log n)实际上比许多O(n)算法更快,不能使用CPU内存缓存.

如果文本包含8 kB长的重复字符串,则基于grow_chains的另一个答案中的代码比问题中的原始示例慢19倍.长篇文章不是典型的古典文学,但它们经常出现在例如"独立"的学校作品集中.该计划不应该冻结它.

我编写了一个示例,并使用相同的Python 2.7,3.3 - 3.6 代码进行测试.


jfs*_*jfs 5

将算法翻译成Python:

from itertools import imap, izip, starmap, tee
from os.path   import commonprefix

def pairwise(iterable): # itertools recipe
    a, b = tee(iterable)
    next(b, None)
    return izip(a, b)

def longest_duplicate_small(data):
    suffixes = sorted(data[i:] for i in xrange(len(data))) # O(n*n) in memory
    return max(imap(commonprefix, pairwise(suffixes)), key=len)
Run Code Online (Sandbox Code Playgroud)

buffer()允许在不复制的情况下获取子字符串:

def longest_duplicate_buffer(data):
    n = len(data)
    sa = sorted(xrange(n), key=lambda i: buffer(data, i)) # suffix array
    def lcp_item(i, j):  # find longest common prefix array item
        start = i
        while i < n and data[i] == data[i + j - start]:
            i += 1
        return i - start, start
    size, start = max(starmap(lcp_item, pairwise(sa)), key=lambda x: x[0])
    return data[start:start + size]
Run Code Online (Sandbox Code Playgroud)

在我的机器上需要 5 秒才能完成iliad.mb.txt

原则上,使用由lcp 数组增强的后缀数组可以在 O(n) 时间和 O(n) 内存中找到重复项。


注意:*_memoryview()已被*_buffer()版本弃用

内存效率更高的版本(与longest_duplicate_small()相比):

def cmp_memoryview(a, b):
    for x, y in izip(a, b):
        if x < y:
            return -1
        elif x > y:
            return 1
    return cmp(len(a), len(b))

def common_prefix_memoryview((a, b)):
    for i, (x, y) in enumerate(izip(a, b)):
        if x != y:
            return a[:i]
    return a if len(a) < len(b) else b

def longest_duplicate(data):
    mv = memoryview(data)
    suffixes = sorted((mv[i:] for i in xrange(len(mv))), cmp=cmp_memoryview)
    result = max(imap(common_prefix_memoryview, pairwise(suffixes)), key=len)
    return result.tobytes()
Run Code Online (Sandbox Code Playgroud)

在我的机器上需要 17 秒iliad.mb.txt。结果是:

对此,其余的亚该亚人同声表示尊重
祭司并接受了他所提供的赎金;但阿伽门农却并非如此,
那人对他恶狠狠地说话,粗鲁地把他打发走。

我必须定义自定义函数来比较memoryview对象,因为memoryview比较要么在 Python 3 中引发异常,要么在 Python 2 中产生错误结果:

On this the rest of the Achaeans with one voice were for respecting
the priest and taking the ransom that he offered; but not so Agamemnon,
who spoke fiercely to him and sent him roughly away. 

相关问题:

查找最长的重复字符串及其在给定字符串中的重复次数

在大量字符串中查找长重复子串


bea*_*mit 4

主要问题似乎是 python 通过复制进行切片: /sf/answers/400544791/

您必须使用内存视图来获取引用而不是副本。当我这样做时,程序idx.sort非常快)。

我相信只要做一点工作,你就能让剩下的工作顺利进行。

编辑:

上述更改不能作为直接替代,因为其cmp工作方式与strcmp. 例如,尝试以下 C 代码:

#include <stdio.h>
#include <string.h>

int main() {
    char* test1 = "ovided by The Internet Classics Archive";
    char* test2 = "rovided by The Internet Classics Archive.";
    printf("%d\n", strcmp(test1, test2));
}
Run Code Online (Sandbox Code Playgroud)

并将结果与​​此 python 进行比较:

test1 = "ovided by The Internet Classics Archive";
test2 = "rovided by The Internet Classics Archive."
print(cmp(test1, test2))
Run Code Online (Sandbox Code Playgroud)

C 代码-3在我的机器上打印,而 python 版本则打印-1。看起来示例C代码正在滥用返回值(毕竟strcmp它被使用了)。qsort我找不到任何关于何时strcmp返回除 以外的内容的文档[-1, 0, 1],但在原始代码中添加 aprintf显示pstrcmp了许多超出该范围的值(3、-31、5 是前 3 个值)。

为了确保这-3不是一些错误代码,如果我们反转 test1 和 test2,我们将得到3.

编辑:

上面是一些有趣的琐事,但在影响任一代码块方面实际上并不正确。当我关闭笔记本电脑并离开 Wi-Fi 区域时,我意识到了这一点……真的应该在点击之前仔细检查所有内容Save

FWIW,cmp最肯定适用于memoryview对象(-1按预期打印):

print(cmp(memoryview(test1), memoryview(test2)))
Run Code Online (Sandbox Code Playgroud)

我不确定为什么代码没有按预期工作。在我的机器上打印出的列表看起来不符合预期。我会研究这个问题并尝试找到更好的解决方案,而不是抓住救命稻草。

  • Bentley 的代码*不*滥用 `strcmp`。它只是用它来比较“qsort”中的字符串,而“qsort”除了返回值的*符号*之外,从不依赖任何东西。 (3认同)
  • 内存视图比较不起作用。请参阅[我的答案](http://stackoverflow.com/a/13574862/4279)中的示例 (2认同)

归档时间:

查看次数:

4811 次

最近记录:

7 年,8 月 前