Tal*_*war 6 python numpy cosine-similarity
我有一个形状为 (149,1001) 的 TF-IDF 矩阵。想要的是计算最后一列与所有列的余弦相似度
这是我所做的
from numpy import dot
from numpy.linalg import norm
for i in range(mat.shape[1]-1):
cos_sim = dot(mat[:,i], mat[:,-1])/(norm(mat[:,i])*norm(mat[:,-1]))
cos_sim
Run Code Online (Sandbox Code Playgroud)
但这个循环使它变慢。那么,有什么有效的方法吗?我只想用 numpy 做
利用2D
矢量化matrix-multiplication
这是 NumPy 在 2D 数据上使用矩阵乘法的一个 -
\np1 = mat[:,-1].dot(mat[:,:-1])\np2 = norm(mat[:,:-1],axis=0)*norm(mat[:,-1])\nout1 = p1/p2\n
Run Code Online (Sandbox Code Playgroud)\n解释: p1
是 的循环的向量化等价物dot(mat[:,i], mat[:,-1])
。p2
是 的(norm(mat[:,i])*norm(mat[:,-1]))
。
运行样品进行验证 -
\nIn [57]: np.random.seed(0)\n ...: mat = np.random.rand(149,1001)\n\nIn [58]: out = np.empty(mat.shape[1]-1)\n ...: for i in range(mat.shape[1]-1):\n ...: out[i] = dot(mat[:,i], mat[:,-1])/(norm(mat[:,i])*norm(mat[:,-1]))\n\nIn [59]: p1 = mat[:,-1].dot(mat[:,:-1])\n ...: p2 = norm(mat[:,:-1],axis=0)*norm(mat[:,-1])\n ...: out1 = p1/p2\n\nIn [60]: np.allclose(out, out1)\nOut[60]: True\n
Run Code Online (Sandbox Code Playgroud)\n时间安排 -
\nIn [61]: %%timeit\n ...: out = np.empty(mat.shape[1]-1)\n ...: for i in range(mat.shape[1]-1):\n ...: out[i] = dot(mat[:,i], mat[:,-1])/(norm(mat[:,i])*norm(mat[:,-1]))\n18.5 ms \xc2\xb1 977 \xc2\xb5s per loop (mean \xc2\xb1 std. dev. of 7 runs, 100 loops each)\n\nIn [62]: %%timeit \n ...: p1 = mat[:,-1].dot(mat[:,:-1])\n ...: p2 = norm(mat[:,:-1],axis=0)*norm(mat[:,-1])\n ...: out1 = p1/p2\n939 \xc2\xb5s \xc2\xb1 29.2 \xc2\xb5s per loop (mean \xc2\xb1 std. dev. of 7 runs, 1000 loops each)\n\n# @yatu's soln\nIn [89]: a = mat\n\nIn [90]: %timeit cosine_similarity(a[None,:,-1] , a.T[:-1])\n2.47 ms \xc2\xb1 461 \xc2\xb5s per loop (mean \xc2\xb1 std. dev. of 7 runs, 100 loops each)\n
Run Code Online (Sandbox Code Playgroud)\nnorm
进一步优化einsum
或者,我们可以p2
用 进行计算np.einsum
。
因此,norm(mat[:,:-1],axis=0)
可以替换为:
np.sqrt(np.einsum('ij,ij->j',mat[:,:-1],mat[:,:-1]))\n
Run Code Online (Sandbox Code Playgroud)\n因此,给我们一个修改后的p2
:
p2 = np.sqrt(np.einsum('ij,ij->j',mat[:,:-1],mat[:,:-1]))*norm(mat[:,-1])\n
Run Code Online (Sandbox Code Playgroud)\n与之前相同的设置的计时 -
\nIn [82]: %%timeit\n ...: p1 = mat[:,-1].dot(mat[:,:-1])\n ...: p2 = np.sqrt(np.einsum('ij,ij->j',mat[:,:-1],mat[:,:-1]))*norm(mat[:,-1])\n ...: out1 = p1/p2\n607 \xc2\xb5s \xc2\xb1 132 \xc2\xb5s per loop (mean \xc2\xb1 std. dev. of 7 runs, 1000 loops each)\n
Run Code Online (Sandbox Code Playgroud)\n30x+
比疯狂的加速!
归档时间: |
|
查看次数: |
5532 次 |
最近记录: |