use*_*453 12 python string comparison sequencematcher
SequenceMatcher根据参数的顺序,我对两个不同的答案感到有些困惑.为什么会这样?
SequenceMatcher不可交换:
>>> from difflib import SequenceMatcher
>>> SequenceMatcher(None, "Ebojfm Mzpm", "Ebfo ef Mfpo").ratio()
0.6086956521739131
>>> SequenceMatcher(None, "Ebfo ef Mfpo", "Ebojfm Mzpm").ratio()
0.5217391304347826
Run Code Online (Sandbox Code Playgroud)
SequenceMatcher.ratio内部用于SequenceMatcher.get_matching_blocks计算比率,我将逐步引导您看一下如何发生:
返回描述匹配子序列的三元组列表。每个三元组的形式都
(i, j, n)意味着a[i:i+n] == b[j:j+n]。三元在单调递增i和j。最后一个三元组是一个虚拟对象,并具有值
(len(a), len(b), 0)。它是的唯一三元组n == 0。If (i, j, n)和(i', j', n')是列表中的相邻三元组,第二个不是列表中的最后一个三元组,则为i+n != i'或j+n != j';换句话说,相邻的三元组始终描述不相邻的等价块。
ratio内部使用SequenceMatcher.get_matching_blocks的结果,并对所返回的所有匹配序列的大小求和SequenceMatcher.get_matching_blocks。这是来自的确切源代码difflib.py:
matches = sum(triple[-1] for triple in self.get_matching_blocks())
Run Code Online (Sandbox Code Playgroud)
上面的行很关键,因为上面的表达式的结果用于计算比率。我们将很快看到这一点以及它如何影响比率的计算。
m1 = SequenceMatcher(None, "Ebojfm Mzpm", "Ebfo ef Mfpo")
m2 = SequenceMatcher(None, "Ebfo ef Mfpo", "Ebojfm Mzpm")
matches1 = sum(triple[-1] for triple in m1.get_matching_blocks()) # matches1=7
matches2 = sum(triple[-1] for triple in m2.get_matching_blocks()) #matches=6
Run Code Online (Sandbox Code Playgroud)
如您所见,我们有7和6。这些只是所返回的匹配子序列的总和get_matching_blocks。为什么这么重要?这就是为什么以以下方式计算比率(这是从difflib源代码中得出的):
def _calculate_ratio(matches, length):
if length:
return 2.0 * matches / length
return 1.0
Run Code Online (Sandbox Code Playgroud)
length是len(a) + len(b)哪里a,第一个序列b是第二个序列。
好的,足够多的谈话,我们需要采取行动:
>>> length = len("Ebojfm Mzpm") + len("Ebfo ef Mfpo")
>>> m1.ratio()
0.6086956521739131
>>> (2.0 * matches1 / length) == m1.ratio()
True
Run Code Online (Sandbox Code Playgroud)
同样适用于m2:
>>> 2.0 * matches2 / length
0.5217391304347826
>>> (2.0 * matches2 / length) == m2.ratio()
True
Run Code Online (Sandbox Code Playgroud)
注:并非所有的 SequenceMatcher(None a,b).ratio() == SequenceMatcher(None b,a).ratio()都是False,有时也可以是True:
>>> s1 = SequenceMatcher(None, "abcd", "bcde").ratio()
>>> s2 = SequenceMatcher(None, "bcde", "abcd").ratio()
>>> s1 == s2
True
Run Code Online (Sandbox Code Playgroud)
如果您想知道为什么,这是因为
sum(triple[-1] for triple in self.get_matching_blocks())
Run Code Online (Sandbox Code Playgroud)
对于两个相同的SequenceMatcher(None, "abcd", "bcde")和SequenceMatcher(None, "bcde", "abcd")是3。
My answer does not provide exact details of the observed problem, but contains a general explanation of why such things may happen with loosely defined diffing methods.
Essentially everything boils down to the fact that, in the general case,
more than one common subsequences of the same length can be extracted from a given pair of strings, and
longer common subsequences may appear less natural to a human expert than a shorter one.
Since you are puzzled by this particular case let's analyze common subsequence identification on the following pair of strings:
my stackoverflow mysteriesmysteryTo me, the natural match is "MYSTER", as follows:
my stackoverflow MYSTERies
.................MYSTERy..
Run Code Online (Sandbox Code Playgroud)
However, the longest match fully covers the shorter of the two strings as follows:
MY STackovERflow mYsteries
MY.ST.....ER......Y.......
Run Code Online (Sandbox Code Playgroud)
The drawback of such a match is that it introduces multiple matching sub-blocks whereas the (shorter) natural match is contiguous.
Therefore, diffing algorithms are tweaked so that their outputs are more pleasing to the final user. As a result, they are not 100% mathematically elegant and therefore don't possess properties that you would expect from a purely academic (rather than practical) tool.
The documentation of SequenceMatcher contains a corresponding note:
class difflib.SequenceMatcher这是一个灵活的类,用于比较任何类型的序列对,只要序列元素是可散列的。基本算法早于 Ratcliff 和 Obershelp 在 1980 年代后期以双曲线名称“格式塔模式匹配”发布的算法,并且比它更高级一些。这个想法是找到不包含“垃圾”元素的最长连续匹配子序列(Ratcliff 和 Obershelp 算法不处理垃圾)。然后将相同的想法递归地应用于匹配子序列左侧和右侧的序列片段。 这不会产生最少的编辑序列,但确实会产生对人们来说“看起来正确”的匹配。
| 归档时间: |
|
| 查看次数: |
3971 次 |
| 最近记录: |