如何计算非常大的相关矩阵

Question

如何计算非常大的相关矩阵

Gra*_*tty 4 numpy large-data correlation

我有一个 np.array 观测值 z，其中 z.shape 是 (100000, 60)。我想有效地计算 100000x100000 相关矩阵，然后将那些 > 0.95 的元素的坐标和值写入磁盘（这只是总数的一小部分）。

我的暴力版本如下所示，但毫不奇怪，非常慢：

for i1 in range(z.shape[0]):
    for i2 in range(i1+1):
        r = np.corrcoef(z[i1,:],z[i2,:])[0,1]
        if r > 0.95:
            file.write("%6d %6d %.3f\n" % (i1,i2,r))

Run Code Online (Sandbox Code Playgroud)

我意识到，使用 np.corrcoef(z) 在一次操作中可以更有效地计算相关矩阵本身，但内存需求很大。我还知道，可以将数据集分解为块并一次计算相关矩阵的一小部分，但对其进行编程并跟踪索引似乎不必要地复杂。

是否有另一种方法（例如，使用 memmap 或 pytables）既易于编码又不会对物理内存提出过多要求？

Answer 1

Gra*_*tty 5

在尝试了其他人提出的 memmap 解决方案后，我发现虽然它比我原来的方法（在我的 Macbook 上花了大约 4 天）要快，但仍然花费了很长的时间（至少一天）——大概是由于逐个元素写入输出文件的效率低下。考虑到我需要多次运行计算，这是不可接受的。

最后，（对我来说）最好的解决方案是登录 Amazon Web Services EC2 门户，创建一个具有 120+ GiB RAM 的虚拟机实例（从配备 Anaconda Python 的映像开始），上传输入数据文件，并完全在核心内存中进行计算（使用矩阵乘法）。大约两分钟就完成了！

作为参考，我使用的代码基本上是这样的：

import numpy as np
import pickle
import h5py

# read nparray, dimensions (102000, 60)

infile = open(r'file.dat', 'rb')
x = pickle.load(infile)
infile.close()     

# z-normalize the data -- first compute means and standard deviations
xave = np.average(x,axis=1)
xstd = np.std(x,axis=1)

# transpose for the sake of broadcasting (doesn't seem to work otherwise!)
ztrans = x.T - xave
ztrans /= xstd

# transpose back
z = ztrans.T

# compute correlation matrix - shape = (102000, 102000)
arr = np.matmul(z, z.T)   
arr /= z.shape[0]

# output to HDF5 file
with h5py.File('correlation_matrix.h5', 'w') as hf:
        hf.create_dataset("correlation",  data=arr)

Run Code Online (Sandbox Code Playgroud)

归档时间：	7 年，2 月前
查看次数：	7413 次
最近记录：	4 年，6 月前