具有字符串的两个列表的相似度得分

Tas*_*sos 2 python comparison similarity string-comparison fuzzy-comparison

我有一个字符串列表作为查询和一些其他字符串列表的hundrends.我想将查询与其他列表进行比较,并提取它们之间的相似性分数.

例:

query = ["football", "basketball", "martial arts", "baseball"]

list1 = ["apple", "football", "basketball court"]

list2 = ["ball"]

list3 = ["martial-arts", "baseball", "banana", "food", "doctor"]
Run Code Online (Sandbox Code Playgroud)

我现在在做什么,我对结果不满意是对它们的绝对比较.

score = 0
for i in query:
   if i in list1:
      score += 1

score_of_list1 = score*100//len(list1)
Run Code Online (Sandbox Code Playgroud)

我找到了一个可以帮助我模糊的图书馆,但我在想,如果你还有其他方法可以提出建议.

Reu*_*ani 5

如果你正在寻找一种方法来找到字符串之间的相似性,那么这个问题建议将Levenshtein距离作为一种方法.

有一个准备好的解决方案,它也存在于自然语言工具包库中.

天真的整合将是(我用随机只是为了有一个结果它没有意义明显.):

#!/usr/bin/env python
query = ["football", "basketball", "martial arts", "baseball"]
lists = [["apple", "football", "basketball court"], ["ball"], ["martial-arts", "baseball", "banana", "food", "doctor"]]
from random import random

def fake_levenshtein(word1, word2):
    return random()

def avg_list(l):
        return reduce(lambda x, y: x + y, l) / len(l)

for l in lists:
    score = []
    for w1 in l:
        for w2 in query:
            score.append(fake_levenshtein(w1, w2))
    print avg_list(score)
Run Code Online (Sandbox Code Playgroud)

祝好运.