计算稀疏矩阵上的Jaccard距离

Kag*_*sen 6 python numpy scipy sparse-matrix

我有一个大的稀疏矩阵 - 使用scipy的sparse.csr_matrix.值是二进制的.对于每一行,我需要计算相同矩阵中每行的Jaccard距离.最有效的方法是什么?即使对于10.000 x 10.000矩阵,我的运行时间也需要几分钟才能完成.

当前解决方案

def jaccard(a, b):
    intersection = float(len(set(a) & set(b)))
    union = float(len(set(a) | set(b)))
    return 1.0 - (intersection/union)

def regions(csr, p, epsilon):
    neighbors = []
    for index in range(len(csr.indptr)-1):
        if jaccard(p, csr.indices[csr.indptr[index]:csr.indptr[index+1]]) <= epsilon:
            neighbors.append(index)
    return neighbors
csr = scipy.sparse.csr_matrix("file")
regions(csr, 0.51) #this is called for every row
Run Code Online (Sandbox Code Playgroud)

小智 13

如果使用矩阵乘法计算集合交叉点,然后使用规则|union(a, b)| == |a| + |b| - |intersection(a, b)|来确定联合,则矢量化相对容易:

# Not actually necessary for sparse matrices, but it is for 
# dense matrices and ndarrays, if X.dtype is integer.
from __future__ import division

def pairwise_jaccard(X):
    """Computes the Jaccard distance between the rows of `X`.
    """
    X = X.astype(bool).astype(int)

    intrsct = X.dot(X.T)
    row_sums = intrsct.diagonal()
    unions = row_sums[:,None] + row_sums - intrsct
    dist = 1.0 - intrsct / unions
    return dist
Run Code Online (Sandbox Code Playgroud)

注意转换为bool然后是int,因为dtype X必须足够大以累积最大行总和的两倍,并且条目X必须为零或一.这段代码的缺点是RAM很重,因为unionsdists是密集的矩阵.

如果您只对距离小于某些截止值的距离感兴趣epsilon,则可以针对稀疏矩阵调整代码:

from scipy.sparse import csr_matrix

def pairwise_jaccard_sparse(csr, epsilon):
    """Computes the Jaccard distance between the rows of `csr`,
    smaller than the cut-off distance `epsilon`.
    """
    assert(0 < epsilon < 1)
    csr = csr_matrix(csr).astype(bool).astype(int)

    csr_rownnz = csr.getnnz(axis=1)
    intrsct = csr.dot(csr.T)

    nnz_i = np.repeat(csr_rownnz, intrsct.getnnz(axis=1))
    unions = nnz_i + csr_rownnz[intrsct.indices] - intrsct.data
    dists = 1.0 - intrsct.data / unions

    mask = (dists > 0) & (dists <= epsilon)
    data = dists[mask]
    indices = intrsct.indices[mask]

    rownnz = np.add.reduceat(mask, intrsct.indptr[:-1])
    indptr = np.r_[0, np.cumsum(rownnz)]

    out = csr_matrix((data, indices, indptr), intrsct.shape)
    return out
Run Code Online (Sandbox Code Playgroud)

如果这仍然需要很多RAM,你可以尝试在一个维度上进行向量化,而在另一个维度上进行Python循环.