计算单词列表之间的相似度

You*_*ani 2 python similarity data-mining text-mining

我想计算两个单词列表之间的相似度,例如:

['email','user','this','email','address','customer']

类似于这个列表:

['email','mail','address','netmail']

我希望比另一个列表具有更高的相似度百分比,例如: ['address','ip','network']即使address存在于列表中。

Dir*_*Bit 10

由于您还没有真正能够演示晶体输出,这是我最好的镜头:

list_A = ['email','user','this','email','address','customer']
list_B = ['email','mail','address','netmail']
Run Code Online (Sandbox Code Playgroud)

在上面的两个列表中,我们将找到列表中每个元素与其余元素之间的余弦相似度。即email来自list_B中的每个元素list_A

def word2vec(word):
    from collections import Counter
    from math import sqrt

    # count the characters in word
    cw = Counter(word)
    # precomputes a set of the different characters
    sw = set(cw)
    # precomputes the "length" of the word vector
    lw = sqrt(sum(c*c for c in cw.values()))

    # return a tuple
    return cw, sw, lw

def cosdis(v1, v2):
    # which characters are common to the two words?
    common = v1[1].intersection(v2[1])
    # by definition of cosine distance we have
    return sum(v1[0][ch]*v2[0][ch] for ch in common)/v1[2]/v2[2]


list_A = ['email','user','this','email','address','customer']
list_B = ['email','mail','address','netmail']

threshold = 0.80     # if needed
for key in list_A:
    for word in list_B:
        try:
            # print(key)
            # print(word)
            res = cosdis(word2vec(word), word2vec(key))
            # print(res)
            print("The cosine similarity between : {} and : {} is: {}".format(word, key, res*100))
            # if res > threshold:
            #     print("Found a word with cosine distance > 80 : {} with original word: {}".format(word, key))
        except IndexError:
            pass
Run Code Online (Sandbox Code Playgroud)

输出

The cosine similarity between : email and : email is: 100.0
The cosine similarity between : mail and : email is: 89.44271909999159
The cosine similarity between : address and : email is: 26.967994498529684
The cosine similarity between : netmail and : email is: 84.51542547285166
The cosine similarity between : email and : user is: 22.360679774997898
The cosine similarity between : mail and : user is: 0.0
The cosine similarity between : address and : user is: 60.30226891555272
The cosine similarity between : netmail and : user is: 18.89822365046136
The cosine similarity between : email and : this is: 22.360679774997898
The cosine similarity between : mail and : this is: 25.0
The cosine similarity between : address and : this is: 30.15113445777636
The cosine similarity between : netmail and : this is: 37.79644730092272
The cosine similarity between : email and : email is: 100.0
The cosine similarity between : mail and : email is: 89.44271909999159
The cosine similarity between : address and : email is: 26.967994498529684
The cosine similarity between : netmail and : email is: 84.51542547285166
The cosine similarity between : email and : address is: 26.967994498529684
The cosine similarity between : mail and : address is: 15.07556722888818
The cosine similarity between : address and : address is: 100.0
The cosine similarity between : netmail and : address is: 22.79211529192759
The cosine similarity between : email and : customer is: 31.62277660168379
The cosine similarity between : mail and : customer is: 17.677669529663685
The cosine similarity between : address and : customer is: 42.640143271122085
The cosine similarity between : netmail and : customer is: 40.08918628686365
Run Code Online (Sandbox Code Playgroud)

注意:我还threshold对代码中的部分进行了注释,以防您只想要单词的相似度超过某个阈值,即 80%

编辑

OP但我想要做的不是逐字比较,而是逐个列表

使用Countermath

from collections import Counter
import math

counterA = Counter(list_A)
counterB = Counter(list_B)


def counter_cosine_similarity(c1, c2):
    terms = set(c1).union(c2)
    dotprod = sum(c1.get(k, 0) * c2.get(k, 0) for k in terms)
    magA = math.sqrt(sum(c1.get(k, 0)**2 for k in terms))
    magB = math.sqrt(sum(c2.get(k, 0)**2 for k in terms))
    return dotprod / (magA * magB)

print(counter_cosine_similarity(counterA, counterB) * 100)
Run Code Online (Sandbox Code Playgroud)

输出

53.03300858899106
Run Code Online (Sandbox Code Playgroud)


KRK*_*rov 5

您可以利用 Scikit-Learn(或其他 NLP)库的强大功能来完成此任务。下面的示例使用 CountVectorizer,但为了对文档进行更复杂的分析,最好使用 TFIDF 矢量化器。

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

def vect_cos(vect, test_list):
    """ Vectorise text and compute the cosine similarity """
    query_0 = vect.transform([' '.join(vect.get_feature_names())])
    query_1 = vect.transform(test_list)
    cos_sim = cosine_similarity(query_0.A, query_1.A)  # displays the resulting matrix
    return query_1, np.round(cos_sim.squeeze(), 3)

# Train the vectorizer
vocab=['email','user','this','email','address','customer']
vectoriser = CountVectorizer().fit(vocab)
vectoriser.vocabulary_ # show the word-matrix position pairs

# Analyse  list_1
list_1 = ['email','mail','address','netmail']
list_1_vect, list_1_cos = vect_cos(vectoriser, [' '.join(list_1)])

# Analyse list_2
list_2 = ['address','ip','network']
list_2_vect, list_2_cos = vect_cos(vectoriser, [' '.join(list_2)])

print('\nThe cosine similarity for the first list is {}.'.format(list_1_cos))
print('\nThe cosine similarity for the second list is {}.'.format(list_2_cos))
Run Code Online (Sandbox Code Playgroud)

输出

The cosine similarity for the first list is 0.632.

The cosine similarity for the second list is 0.447.
Run Code Online (Sandbox Code Playgroud)

编辑

如果您想计算“电子邮件”与任何其他字符串列表之间的余弦相似度,请使用“电子邮件”训练矢量化器,然后分析其他文档。

# Train the vectorizer
vocab=['email']
vectoriser = CountVectorizer().fit(vocab)

# Analyse  list_1
list_1 =['email','mail','address','netmail']
list_1_vect, list_1_cos = vect_cos(vectoriser, [' '.join(list_1)])
print('\nThe cosine similarity for the first list is {}.'.format(list_1_cos))
Run Code Online (Sandbox Code Playgroud)

输出

The cosine similarity for the first list is 1.0.
Run Code Online (Sandbox Code Playgroud)