Kag*_*sen 6 python numpy scipy sparse-matrix
我有一个大的稀疏矩阵 - 使用scipy的sparse.csr_matrix.值是二进制的.对于每一行,我需要计算相同矩阵中每行的Jaccard距离.最有效的方法是什么?即使对于10.000 x 10.000矩阵,我的运行时间也需要几分钟才能完成.
当前解决方案
def jaccard(a, b):
intersection = float(len(set(a) & set(b)))
union = float(len(set(a) | set(b)))
return 1.0 - (intersection/union)
def regions(csr, p, epsilon):
neighbors = []
for index in range(len(csr.indptr)-1):
if jaccard(p, csr.indices[csr.indptr[index]:csr.indptr[index+1]]) <= epsilon:
neighbors.append(index)
return neighbors
csr = scipy.sparse.csr_matrix("file")
regions(csr, 0.51) #this is called for every row
Run Code Online (Sandbox Code Playgroud)
小智 13
如果使用矩阵乘法计算集合交叉点,然后使用规则|union(a, b)| == |a| + |b| - |intersection(a, b)|来确定联合,则矢量化相对容易:
# Not actually necessary for sparse matrices, but it is for
# dense matrices and ndarrays, if X.dtype is integer.
from __future__ import division
def pairwise_jaccard(X):
"""Computes the Jaccard distance between the rows of `X`.
"""
X = X.astype(bool).astype(int)
intrsct = X.dot(X.T)
row_sums = intrsct.diagonal()
unions = row_sums[:,None] + row_sums - intrsct
dist = 1.0 - intrsct / unions
return dist
Run Code Online (Sandbox Code Playgroud)
注意转换为bool然后是int,因为dtype X必须足够大以累积最大行总和的两倍,并且条目X必须为零或一.这段代码的缺点是RAM很重,因为unions它dists是密集的矩阵.
如果您只对距离小于某些截止值的距离感兴趣epsilon,则可以针对稀疏矩阵调整代码:
from scipy.sparse import csr_matrix
def pairwise_jaccard_sparse(csr, epsilon):
"""Computes the Jaccard distance between the rows of `csr`,
smaller than the cut-off distance `epsilon`.
"""
assert(0 < epsilon < 1)
csr = csr_matrix(csr).astype(bool).astype(int)
csr_rownnz = csr.getnnz(axis=1)
intrsct = csr.dot(csr.T)
nnz_i = np.repeat(csr_rownnz, intrsct.getnnz(axis=1))
unions = nnz_i + csr_rownnz[intrsct.indices] - intrsct.data
dists = 1.0 - intrsct.data / unions
mask = (dists > 0) & (dists <= epsilon)
data = dists[mask]
indices = intrsct.indices[mask]
rownnz = np.add.reduceat(mask, intrsct.indptr[:-1])
indptr = np.r_[0, np.cumsum(rownnz)]
out = csr_matrix((data, indices, indptr), intrsct.shape)
return out
Run Code Online (Sandbox Code Playgroud)
如果这仍然需要很多RAM,你可以尝试在一个维度上进行向量化,而在另一个维度上进行Python循环.
| 归档时间: |
|
| 查看次数: |
4292 次 |
| 最近记录: |