如何在python中具有相似性分数的大字符串中找到相似的子字符串？

Question

如何在python中具有相似性分数的大字符串中找到相似的子字符串？

San*_*ath 3 python string nlp distance similarity

我正在寻找的不仅仅是两个文本之间的简单相似度分数。但是字符串中子字符串的相似度得分。说：

text1 = 'cat is sleeping on the mat'.

text2 = 'The cat is sleeping on the red mat in the living room'.

Run Code Online (Sandbox Code Playgroud)

在上面的例子中，所有的词text1都存在于text2完全中，因此相似度应该是 100%。

如果text1缺少某些单词，则得分会更低。

我正在处理一个不同段落大小的大型数据集，因此在具有这种相似性得分的较大段落中找到较小的段落至关重要。

我只发现了比较两个字符串的字符串相似性，例如余弦相似性、difflib 相似性等。但不是关于另一个字符串中的子字符串分数。

Answer 1

Dar*_*nus 5

根据您的描述，如何：

>>> a = "cat is sleeping on the mat"
>>> b = "the cat is sleeping on the red mat in the living room"
>>> a = a.split(" ")
>>> score = 0.0
>>> for word in a: #for every word in your string
        if word in b: #if it is in your bigger string increase score
            score += 1
>>> score/len(a) #obtain percentage given total word number
1.0

Run Code Online (Sandbox Code Playgroud)

如果它有一个丢失的词，例如：

>>> c = "the cat is not sleeping on the mat"
>>> c = c.split(" ")
>>> score = 0.0
>>> for w in c:
        if w in b:
            score +=1
>>> score/len(c)
0.875

Run Code Online (Sandbox Code Playgroud)

此外，您可以按照@roadrunner 的建议进行操作并将其拆分b并保存为一组，以使用b = set(b.split(" ")). 这将降低该部分的复杂度O(1)并将整体算法提高到一个O(n)复杂度。

编辑：您说您已经尝试了一些指标，例如余弦相似度等。但是我怀疑您可能会从检查Levenshtein 距离相似度中受益，我怀疑在这种情况下，作为提供的解决方案的补充，这可能会有用。

很好的答案，绝对是最简单的方法。我认为你需要在这里拆分`b`吗？ (2认同)
是啊，你说得对。这为 OP 提供了一个很好的起点，他/她可以根据他/她正在使用的文本选择使用什么和不使用什么。 (2认同)

归档时间：	7 年，9 月前
查看次数：	3093 次
最近记录：	7 年，9 月前