Python Difflib的SequenceMatcher找不到最长的公共子字符串

Question

Python Difflib的SequenceMatcher找不到最长的公共子字符串

Luk*_*rth 4 python difflib longest-substring

我想用来difflib.SequenceMatcher从两个字符串中提取最长的公共子字符串。我不确定是否发现错误或对的文档有误解find_longest_match。这一点令我感到困惑：

换句话说，在所有最大匹配块中，返回最早在a中开始的一个，在所有最大匹配块中最早在a中开始的，返回最早在b中的一个。

（https://docs.python.org/3.5/library/difflib.html#difflib.SequenceMatcher.find_longest_match）

比较字符串X this is a test和this is a test X，子字符串X实际上是一个最大的块：它不能扩展（即，它是包含最大的）。此外，它是文本A中的第一个此类最大块。但是，它肯定不是最长的公共子字符串。我强烈怀疑这不是find_longest_match应该找到的。

实际上，在此示例中，find_longest_match确实找到了最长的公共子字符串：

>>> l = len("X this is a test")
>>> matcher = difflib.SequenceMatcher(None, "X this is a test", "this is a test X")
>>> matcher.find_longest_match(0, l, 0, l)
Match(a=2, b=0, size=14)

Run Code Online (Sandbox Code Playgroud)

但是，对于其他字符串，我似乎可以挑起上面描述的“查找第一个最大块”（对长字符串，很抱歉，如果我将它们缩短，示例会以某种方式中断）：

>>> s1 = "e-like graph visualization using a spanning tree-driven layout technique with constraints specified by layers and the ordering of groups of nodes within layers. We propose a new method of how the orde"
>>> s2 = "itree graph visualization using a spanning tree-driven layout technique with constraints speci ed by layers and the ordering of groups of nodes within layers. We propose a new method of how the drivin"
>>> matcher = difflib.SequenceMatcher(None, s1, s2)
>>> matcher.find_longest_match(1, 149, 5, 149)
Match(a=1, b=47, size=1)

Run Code Online (Sandbox Code Playgroud)

在这种情况下，它将第一个-in 匹配s1[1]到-in s2[47]，仅此而已。最长的公共子字符串可能是以graph visualization using…

我是否发现错误，或者描述这种行为的文档是否还有其他可能的解释？

我在Ubuntu上使用Python 3.5.2。

Answer 1

Luk*_*rth 6

好吧，我知道了。如果有人遇到相同的问题：SequenceMatcher的autojunk参数会做一些奇怪的事情：

试探法计算每个单个项目出现在序列中的次数。如果某项的重复项（在第一个项之后）占序列的1％以上，并且该序列的长度至少为200，则该项目被标记为“受欢迎”，并且出于序列匹配的目的被视为垃圾。

据我所知，匹配器永远不会找到包含“垃圾”的任何匹配项。不确定为什么有用，但是默认情况下启用了它。这也解释了为什么当我将字符串缩短时，上面的示例会中断。但是，它确实大大加快了LCS搜索的速度。

因此，结论是：您可能想要传递autojunk=False构造函数。

然而，文档说“为 isjunk 传递 None 相当于传递 lambda x: False；换句话说，没有元素被忽略。”。他们确实应该提到元素*将*被自动垃圾算法忽略，尽管为第一个参数传递 None 。 (2认同)

归档时间：	7 年，5 月前
查看次数：	452 次
最近记录：	7 年，5 月前