python中稀疏矩阵的相关系数?

use*_*672 11 python numpy scipy sparse-matrix correlation

有谁知道如何从python中的一个非常大的稀疏矩阵计算相关矩阵?基本上,我正在寻找类似于numpy.corrcoefscipy稀疏矩阵的东西.

ali*_*i_m 9

您可以从协方差矩阵中直接计算相关系数,如下所示:

import numpy as np
from scipy import sparse

def sparse_corrcoef(A, B=None):

    if B is not None:
        A = sparse.vstack((A, B), format='csr')

    A = A.astype(np.float64)
    n = A.shape[1]

    # Compute the covariance matrix
    rowsum = A.sum(1)
    centering = rowsum.dot(rowsum.T.conjugate()) / n
    C = (A.dot(A.T.conjugate()) - centering) / (n - 1)

    # The correlation coefficients are given by
    # C_{i,j} / sqrt(C_{i} * C_{j})
    d = np.diag(C)
    coeffs = C / np.sqrt(np.outer(d, d))

    return coeffs
Run Code Online (Sandbox Code Playgroud)

检查它是否正常工作:

# some smallish sparse random matrices
a = sparse.rand(100, 100000, density=0.1, format='csr')
b = sparse.rand(100, 100000, density=0.1, format='csr')

coeffs1 = sparse_corrcoef(a, b)
coeffs2 = np.corrcoef(a.todense(), b.todense())

print(np.allclose(coeffs1, coeffs2))
# True
Run Code Online (Sandbox Code Playgroud)

被警告:

计算协方差矩阵所需的存储量C将在很大程度上取决于稀疏结构A(B如果给定).例如,如果A(m, n)仅包含单列非零值C(n, n)矩阵,那么将是包含所有非零值的矩阵.如果n大,那么就内存消耗而言,这可能是非常坏的消息.

  • 除非数据已经居中,否则“A = A - A.mean(1)”将破坏任何稀疏性。您不妨先转换为密集! (2认同)

Alt*_*Alt 8

只是使用numpy:

import numpy as np    
C=((A.T*A -(sum(A).T*sum(A)/N))/(N-1)).todense()
V=np.sqrt(np.mat(np.diag(C)).T*np.mat(np.diag(C)))
COV = np.divide(C,V+1e-119)
Run Code Online (Sandbox Code Playgroud)

  • 这是一个很好的回应.它产生一个密集的协方差矩阵,但从不改变输入矩阵的稀疏模式. (2认同)