如何找到一个向量与矩阵的余弦相似度

Tal*_*war 6 python numpy cosine-similarity

我有一个形状为 (149,1001) 的 TF-IDF 矩阵。想要的是计算最后一列与所有列的余弦相似度

这是我所做的

from numpy import dot
from numpy.linalg import norm
for i in range(mat.shape[1]-1):
    cos_sim = dot(mat[:,i], mat[:,-1])/(norm(mat[:,i])*norm(mat[:,-1]))
    cos_sim
Run Code Online (Sandbox Code Playgroud)

但这个循环使它变慢。那么,有什么有效的方法吗?我只想用 numpy 做

Div*_*kar 6

利用2D矢量化matrix-multiplication

\n

这是 NumPy 在 2D 数据上使用矩阵乘法的一个 -

\n
p1 = mat[:,-1].dot(mat[:,:-1])\np2 = norm(mat[:,:-1],axis=0)*norm(mat[:,-1])\nout1 = p1/p2\n
Run Code Online (Sandbox Code Playgroud)\n

解释: p1是 的循环的向量化等价物dot(mat[:,i], mat[:,-1])p2是 的(norm(mat[:,i])*norm(mat[:,-1]))

\n

运行样品进行验证 -

\n
In [57]: np.random.seed(0)\n    ...: mat = np.random.rand(149,1001)\n\nIn [58]: out = np.empty(mat.shape[1]-1)\n    ...: for i in range(mat.shape[1]-1):\n    ...:     out[i] = dot(mat[:,i], mat[:,-1])/(norm(mat[:,i])*norm(mat[:,-1]))\n\nIn [59]: p1 = mat[:,-1].dot(mat[:,:-1])\n    ...: p2 = norm(mat[:,:-1],axis=0)*norm(mat[:,-1])\n    ...: out1 = p1/p2\n\nIn [60]: np.allclose(out, out1)\nOut[60]: True\n
Run Code Online (Sandbox Code Playgroud)\n

时间安排 -

\n
In [61]: %%timeit\n    ...: out = np.empty(mat.shape[1]-1)\n    ...: for i in range(mat.shape[1]-1):\n    ...:     out[i] = dot(mat[:,i], mat[:,-1])/(norm(mat[:,i])*norm(mat[:,-1]))\n18.5 ms \xc2\xb1 977 \xc2\xb5s per loop (mean \xc2\xb1 std. dev. of 7 runs, 100 loops each)\n\nIn [62]: %%timeit   \n    ...: p1 = mat[:,-1].dot(mat[:,:-1])\n    ...: p2 = norm(mat[:,:-1],axis=0)*norm(mat[:,-1])\n    ...: out1 = p1/p2\n939 \xc2\xb5s \xc2\xb1 29.2 \xc2\xb5s per loop (mean \xc2\xb1 std. dev. of 7 runs, 1000 loops each)\n\n# @yatu's soln\nIn [89]: a = mat\n\nIn [90]: %timeit cosine_similarity(a[None,:,-1] , a.T[:-1])\n2.47 ms \xc2\xb1 461 \xc2\xb5s per loop (mean \xc2\xb1 std. dev. of 7 runs, 100 loops each)\n
Run Code Online (Sandbox Code Playgroud)\n

norm进一步优化einsum

\n

或者,我们可以p2用 进行计算np.einsum

\n

因此,norm(mat[:,:-1],axis=0)可以替换为:

\n
np.sqrt(np.einsum('ij,ij->j',mat[:,:-1],mat[:,:-1]))\n
Run Code Online (Sandbox Code Playgroud)\n

因此,给我们一个修改后的p2

\n
p2 = np.sqrt(np.einsum('ij,ij->j',mat[:,:-1],mat[:,:-1]))*norm(mat[:,-1])\n
Run Code Online (Sandbox Code Playgroud)\n

与之前相同的设置的计时 -

\n
In [82]: %%timeit\n    ...: p1 = mat[:,-1].dot(mat[:,:-1])\n    ...: p2 = np.sqrt(np.einsum('ij,ij->j',mat[:,:-1],mat[:,:-1]))*norm(mat[:,-1])\n    ...: out1 = p1/p2\n607 \xc2\xb5s \xc2\xb1 132 \xc2\xb5s per loop (mean \xc2\xb1 std. dev. of 7 runs, 1000 loops each)\n
Run Code Online (Sandbox Code Playgroud)\n

30x+比疯狂的加速!

\n