在稀疏矩阵数据的情况下,Python中最快的计算余弦相似度的方法是什么?

zbi*_*nsd 52 python numpy similarity pandas cosine-similarity

给定稀疏矩阵列表,计算矩阵中每列(或行)之间的余弦相似度的最佳方法是什么?我宁愿不迭代n次选择两次.

说输入矩阵是:

A= 
[0 1 0 0 1
 0 0 1 1 1
 1 1 0 1 0]
Run Code Online (Sandbox Code Playgroud)

稀疏表示是:

A = 
0, 1
0, 4
1, 2
1, 3
1, 4
2, 0
2, 1
2, 3
Run Code Online (Sandbox Code Playgroud)

在Python中,使用矩阵输入格式很简单:

import numpy as np
from sklearn.metrics import pairwise_distances
from scipy.spatial.distance import cosine

A = np.array(
[[0, 1, 0, 0, 1],
[0, 0, 1, 1, 1],
[1, 1, 0, 1, 0]])

dist_out = 1-pairwise_distances(A, metric="cosine")
dist_out
Run Code Online (Sandbox Code Playgroud)

得到:

array([[ 1.        ,  0.40824829,  0.40824829],
       [ 0.40824829,  1.        ,  0.33333333],
       [ 0.40824829,  0.33333333,  1.        ]])
Run Code Online (Sandbox Code Playgroud)

这对于全矩阵输入来说很好,但我真的想从稀疏表示开始(由于我的矩阵的大小和稀疏性).关于如何最好地实现这一点的任何想法?提前致谢.

Jef*_*ff 53

您可以使用sklearn直接计算稀疏矩阵行上的成对余弦相似度.从版本0.17开始,它还支持稀疏输出:

from sklearn.metrics.pairwise import cosine_similarity
from scipy import sparse

A =  np.array([[0, 1, 0, 0, 1], [0, 0, 1, 1, 1],[1, 1, 0, 1, 0]])
A_sparse = sparse.csr_matrix(A)

similarities = cosine_similarity(A_sparse)
print('pairwise dense output:\n {}\n'.format(similarities))

#also can output sparse matrices
similarities_sparse = cosine_similarity(A_sparse,dense_output=False)
print('pairwise sparse output:\n {}\n'.format(similarities_sparse))
Run Code Online (Sandbox Code Playgroud)

结果:

pairwise dense output:
[[ 1.          0.40824829  0.40824829]
[ 0.40824829  1.          0.33333333]
[ 0.40824829  0.33333333  1.        ]]

pairwise sparse output:
(0, 1)  0.408248290464
(0, 2)  0.408248290464
(0, 0)  1.0
(1, 0)  0.408248290464
(1, 2)  0.333333333333
(1, 1)  1.0
(2, 1)  0.333333333333
(2, 0)  0.408248290464
(2, 2)  1.0
Run Code Online (Sandbox Code Playgroud)

如果你想要逐列余弦相似性,只需事先转置你的输入矩阵:

A_sparse.transpose()
Run Code Online (Sandbox Code Playgroud)

  • 你的两个输入数组都是稀疏的吗?来自文档:“如果为 False,则如果两个输入数组都是稀疏的,则输出是稀疏的。” (2认同)

Way*_*inn 40

以下方法比快30倍scipy.spatial.distance.pdist.它在大型矩阵上运行得非常快(假设你有足够的RAM)

有关如何优化稀疏性的讨论,请参见下文.

# base similarity matrix (all dot products)
# replace this with A.dot(A.T).toarray() for sparse representation
similarity = numpy.dot(A, A.T)


# squared magnitude of preference vectors (number of occurrences)
square_mag = numpy.diag(similarity)

# inverse squared magnitude
inv_square_mag = 1 / square_mag

# if it doesn't occur, set it's inverse magnitude to zero (instead of inf)
inv_square_mag[numpy.isinf(inv_square_mag)] = 0

# inverse of the magnitude
inv_mag = numpy.sqrt(inv_square_mag)

# cosine similarity (elementwise multiply by inverse magnitudes)
cosine = similarity * inv_mag
cosine = cosine.T * inv_mag
Run Code Online (Sandbox Code Playgroud)

如果您的问题是大规模二进制首选项问题的典型问题,那么您在一个维度中的条目数量将多于另一个维度.此外,短维度是您想要计算其间相似性的条目.我们将此维度称为"项目"维度.

如果是这种情况,请在行中列出"项目"并A使用创建scipy.sparse.然后按照指示更换第一行.

如果您的问题不典型,则需要进行更多修改.这些应该是相当简单的基本numpy操作替代品scipy.sparse.


小智 11

我上面尝试了一些方法.然而,@ zbinsd的实验有其局限性.实验中使用的矩阵稀疏度极低,而实际稀疏度通常超过90%.在我的情况下,稀疏的形状为(7000,25000),稀疏度为97%.方法4非常慢,我无法容忍得到结果.我使用方法6,它在10秒内完成.令人惊讶的是,我尝试了下面的方法,它只用了0.247秒.

import sklearn.preprocessing as pp

def cosine_similarities(mat):
    col_normed_mat = pp.normalize(mat.tocsc(), axis=0)
    return col_normed_mat.T * col_normed_mat
Run Code Online (Sandbox Code Playgroud)

这种有效的方法通过在此处输入链接描述来链接


zbi*_*nsd 5

我拿了所有这些答案并写了一个脚本来1.验证每个结果(见下面的断言)和2.看哪个是最快的.代码和结果如下:

# Imports
import numpy as np
import scipy.sparse as sp
from scipy.spatial.distance import squareform, pdist
from sklearn.metrics.pairwise import linear_kernel
from sklearn.preprocessing import normalize
from sklearn.metrics.pairwise import cosine_similarity

# Create an adjacency matrix
np.random.seed(42)
A = np.random.randint(0, 2, (10000, 100)).astype(float).T

# Make it sparse
rows, cols = np.where(A)
data = np.ones(len(rows))
Asp = sp.csr_matrix((data, (rows, cols)), shape = (rows.max()+1, cols.max()+1))

print "Input data shape:", Asp.shape

# Define a function to calculate the cosine similarities a few different ways
def calc_sim(A, method=1):
    if method == 1:
        return 1 - squareform(pdist(A, metric='cosine'))
    if method == 2:
        Anorm = A / np.linalg.norm(A, axis=-1)[:, np.newaxis]
        return np.dot(Anorm, Anorm.T)
    if method == 3:
        Anorm = A / np.linalg.norm(A, axis=-1)[:, np.newaxis]
        return linear_kernel(Anorm)
    if method == 4:
        similarity = np.dot(A, A.T)

        # squared magnitude of preference vectors (number of occurrences)
        square_mag = np.diag(similarity)

        # inverse squared magnitude
        inv_square_mag = 1 / square_mag

        # if it doesn't occur, set it's inverse magnitude to zero (instead of inf)
        inv_square_mag[np.isinf(inv_square_mag)] = 0

        # inverse of the magnitude
        inv_mag = np.sqrt(inv_square_mag)

        # cosine similarity (elementwise multiply by inverse magnitudes)
        cosine = similarity * inv_mag
        return cosine.T * inv_mag
    if method == 5:
        '''
        Just a version of method 4 that takes in sparse arrays
        '''
        similarity = A*A.T
        square_mag = np.array(A.sum(axis=1))
        # inverse squared magnitude
        inv_square_mag = 1 / square_mag

        # if it doesn't occur, set it's inverse magnitude to zero (instead of inf)
        inv_square_mag[np.isinf(inv_square_mag)] = 0

        # inverse of the magnitude
        inv_mag = np.sqrt(inv_square_mag).T

        # cosine similarity (elementwise multiply by inverse magnitudes)
        cosine = np.array(similarity.multiply(inv_mag))
        return cosine * inv_mag.T
    if method == 6:
        return cosine_similarity(A)

# Assert that all results are consistent with the first model ("truth")
for m in range(1, 7):
    if m in [5]: # The sparse case
        np.testing.assert_allclose(calc_sim(A, method=1), calc_sim(Asp, method=m))
    else:
        np.testing.assert_allclose(calc_sim(A, method=1), calc_sim(A, method=m))

# Time them:
print "Method 1"
%timeit calc_sim(A, method=1)
print "Method 2"
%timeit calc_sim(A, method=2)
print "Method 3"
%timeit calc_sim(A, method=3)
print "Method 4"
%timeit calc_sim(A, method=4)
print "Method 5"
%timeit calc_sim(Asp, method=5)
print "Method 6"
%timeit calc_sim(A, method=6)
Run Code Online (Sandbox Code Playgroud)

结果:

Input data shape: (100, 10000)
Method 1
10 loops, best of 3: 71.3 ms per loop
Method 2
100 loops, best of 3: 8.2 ms per loop
Method 3
100 loops, best of 3: 8.6 ms per loop
Method 4
100 loops, best of 3: 2.54 ms per loop
Method 5
10 loops, best of 3: 73.7 ms per loop
Method 6
10 loops, best of 3: 77.3 ms per loop
Run Code Online (Sandbox Code Playgroud)