Tal*_*war 6 python numpy cosine-similarity
我有一个形状为 (149,1001) 的 TF-IDF 矩阵。想要的是计算最后一列与所有列的余弦相似度
这是我所做的
from numpy import dot
from numpy.linalg import norm
for i in range(mat.shape[1]-1):
    cos_sim = dot(mat[:,i], mat[:,-1])/(norm(mat[:,i])*norm(mat[:,-1]))
    cos_sim
但这个循环使它变慢。那么,有什么有效的方法吗?我只想用 numpy 做
利用2D矢量化matrix-multiplication
这是 NumPy 在 2D 数据上使用矩阵乘法的一个 -
\np1 = mat[:,-1].dot(mat[:,:-1])\np2 = norm(mat[:,:-1],axis=0)*norm(mat[:,-1])\nout1 = p1/p2\n解释: p1是 的循环的向量化等价物dot(mat[:,i], mat[:,-1])。p2是 的(norm(mat[:,i])*norm(mat[:,-1]))。
运行样品进行验证 -
\nIn [57]: np.random.seed(0)\n    ...: mat = np.random.rand(149,1001)\n\nIn [58]: out = np.empty(mat.shape[1]-1)\n    ...: for i in range(mat.shape[1]-1):\n    ...:     out[i] = dot(mat[:,i], mat[:,-1])/(norm(mat[:,i])*norm(mat[:,-1]))\n\nIn [59]: p1 = mat[:,-1].dot(mat[:,:-1])\n    ...: p2 = norm(mat[:,:-1],axis=0)*norm(mat[:,-1])\n    ...: out1 = p1/p2\n\nIn [60]: np.allclose(out, out1)\nOut[60]: True\n时间安排 -
\nIn [61]: %%timeit\n    ...: out = np.empty(mat.shape[1]-1)\n    ...: for i in range(mat.shape[1]-1):\n    ...:     out[i] = dot(mat[:,i], mat[:,-1])/(norm(mat[:,i])*norm(mat[:,-1]))\n18.5 ms \xc2\xb1 977 \xc2\xb5s per loop (mean \xc2\xb1 std. dev. of 7 runs, 100 loops each)\n\nIn [62]: %%timeit   \n    ...: p1 = mat[:,-1].dot(mat[:,:-1])\n    ...: p2 = norm(mat[:,:-1],axis=0)*norm(mat[:,-1])\n    ...: out1 = p1/p2\n939 \xc2\xb5s \xc2\xb1 29.2 \xc2\xb5s per loop (mean \xc2\xb1 std. dev. of 7 runs, 1000 loops each)\n\n# @yatu's soln\nIn [89]: a = mat\n\nIn [90]: %timeit cosine_similarity(a[None,:,-1] , a.T[:-1])\n2.47 ms \xc2\xb1 461 \xc2\xb5s per loop (mean \xc2\xb1 std. dev. of 7 runs, 100 loops each)\nnorm进一步优化einsum
或者,我们可以p2用 进行计算np.einsum。
因此,norm(mat[:,:-1],axis=0)可以替换为:
np.sqrt(np.einsum('ij,ij->j',mat[:,:-1],mat[:,:-1]))\n因此,给我们一个修改后的p2:
p2 = np.sqrt(np.einsum('ij,ij->j',mat[:,:-1],mat[:,:-1]))*norm(mat[:,-1])\n与之前相同的设置的计时 -
\nIn [82]: %%timeit\n    ...: p1 = mat[:,-1].dot(mat[:,:-1])\n    ...: p2 = np.sqrt(np.einsum('ij,ij->j',mat[:,:-1],mat[:,:-1]))*norm(mat[:,-1])\n    ...: out1 = p1/p2\n607 \xc2\xb5s \xc2\xb1 132 \xc2\xb5s per loop (mean \xc2\xb1 std. dev. of 7 runs, 1000 loops each)\n30x+比疯狂的加速!