我想计算一个文档中句子之间的编辑距离。我找到了一个计算字符级别距离的代码,但我希望它是字级别的。\n 例如,这个字符级别的输出是 6\xef\xbc\x8c 但我希望它是 1 ,这意味着如果我们想将 b 更改为 a 或将 a 更改为 b \xef\xbc\x9a,则只需删除一个单词即可
\n\na = "The patient tolerated this ."\nb = "The patient tolerated ."\n\ndef levenshtein_distance(a, b):\n\n if a == b:\n return 0\n if len(a) < len(b):\n a, b = b, a\n if not a:\n return len(b)\n previous_row = range(len(b) + 1)\n for i, column1 in enumerate(a):\n current_row = [i + 1]\n for j, column2 in enumerate(b):\n insertions = previous_row[j + 1] + 1\n deletions = current_row[j] + 1\n substitutions = previous_row[j] + (column1 != column2)\n current_row.append(min(insertions, deletions, substitutions))\n previous_row = current_row\n print (previous_row[-1]) \n return previous_row[-1] \n\nresult = levenshtein_distance(a, b)\n
Run Code Online (Sandbox Code Playgroud)\n
我建议避免重新发明轮子,您可以使用 pylev https://pypi.org/project/pylev/
您只需pip install pylev
在控制台中执行命令即可安装它。然后使用单词而不是字母来计算距离:
import pylev
a = "The patient tolerated this ."
b = "The patient tolerated ."
a = a.split(" ")
b = b.split(" ")
print(pylev.levenshtein(a,b))
Run Code Online (Sandbox Code Playgroud)
请记住,此解决方案区分大小写,并假设所有单词都经过空格剪切。