快速计算数据帧中所有情况之间的余弦相似度

Lui*_*uez 2 python nlp numpy linear-algebra pandas

我正在做一个 NLP 项目,我必须比较来自这个数据帧的许多句子 EG 之间的相似性:

在此处输入图片说明

我尝试的第一件事是将数据帧与自身连接以获得波纹管格式并逐行比较:

在此处输入图片说明

对于大中型/大数据集,我很快就会出现内存不足的问题,例如,对于 10k 行连接,我将获得 100MM 行,而这些行无法放入 ram

我目前的方法是使用以下方式迭代数据帧:

final = pd.DataFrame()

### for each row 
for i in range(len(df_sample)):

    ### select the corresponding vector to compare with 
    v =  df_sample[df_sample.index.isin([i])]["use_vector"].values
    ### compare all cases agains the selected vector
    df_sample.apply(lambda x:  cosine_similarity_numba(x.use_vector,v[0])  ,axis=1)

    ### kept the cases with a similarity over a given th, in this case 0.6
    temp = df_sample[df_sample.apply(lambda x:  cosine_similarity_numba(x.use_vector,v[0])  ,axis=1) > 0.6]  
    ###  filter out the base case 
    temp = temp[~temp.index.isin([i])]
    temp["original_question"] = copy.copy(df_sample[df_sample.index.isin([i])]["questions"].values[0])
    ### append the result     
    final = pd.concat([final,temp])
Run Code Online (Sandbox Code Playgroud)

但是这种方法也不是很快。我怎样才能提高这个过程的性能?

Ser*_*nov 5

您可能采用的一种可能的技巧是从 Facebook 的稀疏 tfidf 表示切换到密集词嵌入fasttext

import fasttext
# wget https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.en.300.bin.gz
model = fasttext.load_model("./cc.en.300.bin")
Run Code Online (Sandbox Code Playgroud)

然后,您可以继续计算具有更高空间效率、上下文感知和性能更好(?)密集词嵌入的余弦相似度:

df = pd.DataFrame({"questions":["This is a question",
                                "This is a similar questin",
                                "And this one is absolutely different"]})

df["vecs"] = df["questions"].apply(model.get_sentence_vector)

from scipy.spatial.distance import pdist, squareform
# only pairwise distance with itself
# vectorized, no doubling data
out = pdist(np.stack(df['vecs']), metric="cosine")
cosine_similarity = squareform(out)
print(cosine_similarity)
Run Code Online (Sandbox Code Playgroud)
[[0.         0.08294727 0.25305626]
 [0.08294727 0.         0.23575631]
 [0.25305626 0.23575631 0.        ]]
Run Code Online (Sandbox Code Playgroud)

注意为好,在记忆效率最高,也获得约10倍的速度增加,由于使用余弦相似的scipy

另一个可能的技巧是将您的相似度向量从默认值转换float64float32or float16

df["vecs"] = df["vecs"].apply(np.float16)
Run Code Online (Sandbox Code Playgroud)

这将为您带来速度和内存增益。


Mr.*_*ple 2

我昨天刚刚写了一个与您类似的问题的答案,即pandas 数据框中的 Top-K 余弦相似度行

import numpy as np
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity

data = {"use_vector": [[-0.1, -0.2, 0.3], [0.1, -0.2, -0.3], [-0.1, 0.2, -0.3]]}
df = pd.DataFrame(data)
print("Data: \n{}\n".format(df))

vectors = []
for v in df['use_vector']:
    vectors.append(v)
vectors_num = len(vectors)
A=np.array(vectors)
# Get similarities matrix, value for each pair at corresponding index of upper triangle of matrix
similarities = cosine_similarity(A)
# Set symmetrical(repetitive) and diagonal(similarity to self) to -2
similarities[np.tril_indices(vectors_num)] = -2
print("Similarities: \n{}\n".format(similarities))
Run Code Online (Sandbox Code Playgroud)

输出:

Data: 
          use_vector
0  [-0.1, -0.2, 0.3]
1  [0.1, -0.2, -0.3]
2  [-0.1, 0.2, -0.3]

Similarities:
[[-2.         -0.42857143 -0.85714286]  # vector 0 & 1, 2
 [-2.         -2.          0.28571429]  # vector 1 & 2
 [-2.         -2.         -2.        ]]
Run Code Online (Sandbox Code Playgroud)