按单词匹配更改,而不是按字符

2 python string diff difflib python-3.x

我正在使用difflib's SequenceMatchertoget_opcodes()而不是突出显示更改css以创建某种 web diff

首先,我设置 amin_delta以便我认为两个字符串不同,如果整个字符串中只有 3 个或更多字符不同,一个接一个(delta意味着一个真实的,遇到的增量,它总结了所有一个字符的变化):

matcher = SequenceMatcher(source_str, diff_str)
min_delta = 3
delta = 0

for tag, i1, i2, j1, j2 in matcher.get_opcodes():
    if tag == "equal":
        continue  # nothing to capture here
    elif tag == "delete":
        if source_str[i1:i2].isspace():
            continue  # be whitespace-agnostic
        else:
            delta += (i2 - i1)  # delete i2-i1 chars
    elif tag == "replace":
        if source_str[i1:i2].isspace() or diff_str[j1:j2].isspace():
            continue  # be whitespace-agnostic
        else:
            delta += (i2 - i1)  # replace i2-i1 chars
    elif tag == "insert":
        if diff_str[j1:j2].isspace():
            continue  # be whitespace-agnostic
        else:
            delta += (j2 - j1)  # insert j2-j1 chars

return_value = True if (delta > min_delta) else False
Run Code Online (Sandbox Code Playgroud)

这有助于我确定两个字符串是否真的不同。效率不高,但我认为没有什么比这更好的了。

然后,我以相同的方式为两个字符串之间的差异着色:

for tag, i1, i2, j1, j2 in matcher.get_opcodes():
    if tag == "equal":
        # bustling with strings, inserting them in <span>s and colorizing
    elif tag == "delete":
        # ...

return_value = old_string, new_string
Run Code Online (Sandbox Code Playgroud)

结果看起来很丑陋(蓝色代表替换,绿色代表新,红色代表删除,没有相等):

例子

所以,这是因为SequenceMatcher匹配每个字符。但我希望它匹配每个单词(可能还有它们周围的空格),或者更吸引人的东西,因为正如您在屏幕截图中看到的那样,第一本书实际上移动到了第四个位置。

在我看来,可以使用isjunk和 的autojunk参数来完成某些事情SequenceMatcher,但我不知道如何lambda为我的目的编写s。

因此,我有两个问题

  1. 是否可以通过单词匹配?可以使用get_opcodes()andSequenceMatcher吗?如果不是,可以用什么代替?

  2. 好吧,这是一个推论,但无论如何:如果可以通过单词进行匹配,那么我可以摆脱肮脏的黑客min_deltaTrue在至少一个单词不同时立即返回,对吗?

Ann*_*wan 7

SequenceMatcher可以接受列表str作为输入。

您可以先将输入拆分为单词,然后使用SequenceMatcher来帮助您区分单词。那么你的彩色差异将是words而不是characters

>>> def my_get_opcodes(a, b):
...     s = SequenceMatcher(None, a, b)
...     for tag, i1, i2, j1, j2 in s.get_opcodes():
...         print('{:7}   a[{}:{}] --> b[{}:{}] {!r:>8} --> {!r}'.format(
...             tag, i1, i2, j1, j2, a[i1:i2], b[j1:j2]))
... 

>>> my_get_opcodes("qabxcd", "abycdf")
delete    a[0:1] --> b[0:0]      'q' --> ''
equal     a[1:3] --> b[0:2]     'ab' --> 'ab'
replace   a[3:4] --> b[2:3]      'x' --> 'y'
equal     a[4:6] --> b[3:5]     'cd' --> 'cd'
insert    a[6:6] --> b[5:6]       '' --> 'f'

# This is the bad result you currently have.
>>> my_get_opcodes("one two three\n", "ore tree emu\n")
equal     a[0:1] --> b[0:1]      'o' --> 'o'
replace   a[1:2] --> b[1:2]      'n' --> 'r'
equal     a[2:5] --> b[2:5]    'e t' --> 'e t'
delete    a[5:10] --> b[5:5]  'wo th' --> ''
equal     a[10:13] --> b[5:8]    'ree' --> 'ree'
insert    a[13:13] --> b[8:12]       '' --> ' emu'
equal     a[13:14] --> b[12:13]     '\n' --> '\n'

>>> my_get_opcodes("one two three\n".split(), "ore tree emu\n".split())
replace   a[0:3] --> b[0:3] ['one', 'two', 'three'] --> ['ore', 'tree', 'emu']

# This may be the result you want.
>>> my_get_opcodes("one two emily three ha\n".split(), "ore tree emily emu haha\n".split())
replace   a[0:2] --> b[0:2] ['one', 'two'] --> ['ore', 'tree']
equal     a[2:3] --> b[2:3] ['emily'] --> ['emily']
replace   a[3:5] --> b[3:5] ['three', 'ha'] --> ['emu', 'haha']

# A more complicated example exhibiting all four kinds of opcodes.
>>> my_get_opcodes("one two emily three yo right end\n".split(), "ore tree emily emu haha yo yes right\n".split())
replace   a[0:2] --> b[0:2] ['one', 'two'] --> ['ore', 'tree']
equal     a[2:3] --> b[2:3] ['emily'] --> ['emily']
replace   a[3:4] --> b[3:5] ['three'] --> ['emu', 'haha']
equal     a[4:5] --> b[5:6]   ['yo'] --> ['yo']
insert    a[5:5] --> b[6:7]       [] --> ['yes']
equal     a[5:6] --> b[7:8] ['right'] --> ['right']
delete    a[6:7] --> b[8:8]  ['end'] --> []
Run Code Online (Sandbox Code Playgroud)

您还可以按行按书按段进行区分。您只需要准备一个可以将整个段落字符串预处理为所需差异块的函数。

例如:

  • 按行区分- 您可能可以使用splitlines()
  • 按书籍进行区分- 您可能可以实现一个功能,去除1.,2.
  • 为了DIFF由段-你可以抛出这样的方式在API中([book_1, author_1, year_1, book_2, author_2, ...], [book_1, author_1, year_1, book_2, author_2, ...])。然后您的着色将按段进行