Ary*_*rya 1 python string similarity sentence-similarity
Python 中是否有任何函数可以接受多行字符串并返回它们相似程度的百分比?类似SequenceMatcher但对于多个字符串。
例如我们有下面的句子
Hello how are you?
Hi how are you?
hi how are you doing?
Hey how is your day?
Run Code Online (Sandbox Code Playgroud)
我希望能够根据句子之间的相似程度获得百分比
假设我们有这三个句子
Hello how are you?
Hello how are you?
Hello how are you?
Run Code Online (Sandbox Code Playgroud)
那么我们应该得到 100% 相似
但如果我们有
Hello how are you?
Hello how are you?
hola como estats?
Run Code Online (Sandbox Code Playgroud)
那么我们应该得到相似度在 67% 左右的数字。
您可以使用pandas数据框进行操作,itertools.combinations计算列表中 2 个字符串的组合并difflib.SequenceMatcher进行相似度计算:
import pandas as pd
import itertools
from difflib import SequenceMatcher
def similarity(a,b):
seq = SequenceMatcher(a=a, b=b)
return seq.ratio()
strings = ['Hello how are you?', 'Hi how are you?', 'hi how are you doing?', 'Hey how is your day?']
combinations = itertools.combinations(strings,2)
df = pd.DataFrame(list(combinations))
df['similarity'] = df.apply(lambda x: similarity(x[0],x[1]), axis=1)
df.similarity.mean()
0.68
Run Code Online (Sandbox Code Playgroud)