如何计算文本中句子之间的编辑距离

0 python nlp

我想计算一个文档中句子之间的编辑距离。我找到了一个计算字符级别距离的代码,但我希望它是字级别的。\n 例如,这个字符级别的输出是 6\xef\xbc\x8c 但我希望它是 1 ,这意味着如果我们想将 b 更改为 a 或将 a 更改为 b \xef\xbc\x9a,则只需删除一个单词即可

\n\n
a = "The patient tolerated this ."\nb = "The patient tolerated ."\n\ndef levenshtein_distance(a, b):\n\n    if a == b:\n        return 0\n    if len(a) < len(b):\n        a, b = b, a\n    if not a:\n        return len(b)\n    previous_row = range(len(b) + 1)\n    for i, column1 in enumerate(a):\n        current_row = [i + 1]\n        for j, column2 in enumerate(b):\n            insertions = previous_row[j + 1] + 1\n            deletions = current_row[j] + 1\n            substitutions = previous_row[j] + (column1 != column2)\n            current_row.append(min(insertions, deletions,    substitutions))\n            previous_row = current_row\n    print (previous_row[-1]) \n    return previous_row[-1] \n\nresult = levenshtein_distance(a, b)\n
Run Code Online (Sandbox Code Playgroud)\n

Daw*_*weo 7

我建议避免重新发明轮子,您可以使用 pylev https://pypi.org/project/pylev/ 您只需pip install pylev在控制台中执行命令即可安装它。然后使用单词而不是字母来计算距离:

 import pylev
 a = "The patient tolerated this ."
 b = "The patient tolerated ."
 a = a.split(" ")
 b = b.split(" ")
 print(pylev.levenshtein(a,b))
Run Code Online (Sandbox Code Playgroud)

请记住,此解决方案区分大小写,并假设所有单词都经过空格剪切。