以便携式数据格式保存/加载scipy稀疏csr_matrix

Hen*_*ton 76 python numpy scipy

如何csr_matrix以便携式格式保存/加载scipy稀疏?scipy稀疏矩阵在Python 3(Windows 64位)上创建,以在Python 2(Linux 64位)上运行.最初,我使用了pickle(使用protocol = 2和fix_imports = True),但这从Python 3.2.2(Windows 64位)到Python 2.7.2(Windows 32位)不起作用并得到错误:

TypeError: ('data type not understood', <built-in function _reconstruct>, (<type 'numpy.ndarray'>, (0,), '[98]')).
Run Code Online (Sandbox Code Playgroud)

接下来,尝试过numpy.save,numpy.load以及scipy.io.mmwrite()并且scipy.io.mmread()这些方法都没有奏效.

Hen*_*ton 106

编辑: SciPy 1.19现在有scipy.sparse.save_npzscipy.sparse.load_npz.

from scipy import sparse

sparse.save_npz("yourmatrix.npz", your_matrix)
your_matrix_back = sparse.load_npz("yourmatrix.npz")
Run Code Online (Sandbox Code Playgroud)

对于这两个函数,file参数也可以是类文件对象(即结果open)而不是文件名.


得到了Scipy用户组的回答:

一个csr_matrix有关系3个数据属性:.data,.indices,和.indptr.所有都是简单的ndarray,因此numpy.save将对它们起作用.用numpy.save或保存三个数组,然后numpy.savez用它们加载numpy.load,然后使用以下命令重新创建稀疏矩阵对象:

new_csr = csr_matrix((data, indices, indptr), shape=(M, N))
Run Code Online (Sandbox Code Playgroud)

例如:

def save_sparse_csr(filename, array):
    np.savez(filename, data=array.data, indices=array.indices,
             indptr=array.indptr, shape=array.shape)

def load_sparse_csr(filename):
    loader = np.load(filename)
    return csr_matrix((loader['data'], loader['indices'], loader['indptr']),
                      shape=loader['shape'])
Run Code Online (Sandbox Code Playgroud)

  • Scipy 1.19现在有`scipy.sparse.save_npz`和`load`. (11认同)
  • 注意:如果save_sparse_csr中的文件名没有扩展名.npz,则会自动添加.这不会在load_sparse_csr函数中自动完成. (6认同)
  • 知道是否有某种原因这不是作为稀疏矩阵对象中的方法实现的?scipy.io.savemat方法似乎工作得足够可靠...... (3认同)
  • 这个例子如何适用于lil-matrix? (3认同)
  • @hpaulj新用户纠正答案可能会有用:版本是scipy 0.19 (3认同)

Fra*_*kow 36

虽然你写的,scipy.io.mmwritescipy.io.mmread不会为你工作,我只想补充,他们是如何工作的.这个问题是否定的.1谷歌击中,所以我自己开始,np.savezpickle.dump切换到简单明显的scipy功能.他们为我工作,不应该被那些没有尝试过的人监督.

from scipy import sparse, io

m = sparse.csr_matrix([[0,0,0],[1,0,0],[0,1,0]])
m              # <3x3 sparse matrix of type '<type 'numpy.int64'>' with 2 stored elements in Compressed Sparse Row format>

io.mmwrite("test.mtx", m)
del m

newm = io.mmread("test.mtx")
newm           # <3x3 sparse matrix of type '<type 'numpy.int32'>' with 2 stored elements in COOrdinate format>
newm.tocsr()   # <3x3 sparse matrix of type '<type 'numpy.int32'>' with 2 stored elements in Compressed Sparse Row format>
newm.toarray() # array([[0, 0, 0], [1, 0, 0], [0, 1, 0]], dtype=int32)
Run Code Online (Sandbox Code Playgroud)


Den*_*zov 25

以下是使用Jupyter笔记本的三个最受欢迎的答案的性能比较.输入是一个1M x 100K随机稀疏矩阵,密度为0.001,包含100M非零值:

from scipy.sparse import random
matrix = random(1000000, 100000, density=0.001, format='csr')

matrix
<1000000x100000 sparse matrix of type '<type 'numpy.float64'>'
with 100000000 stored elements in Compressed Sparse Row format>
Run Code Online (Sandbox Code Playgroud)

io.mmwrite/io.mmread

from scipy.sparse import io

%time io.mmwrite('test_io.mtx', matrix)
CPU times: user 4min 37s, sys: 2.37 s, total: 4min 39s
Wall time: 4min 39s

%time matrix = io.mmread('test_io.mtx')
CPU times: user 2min 41s, sys: 1.63 s, total: 2min 43s
Wall time: 2min 43s    

matrix
<1000000x100000 sparse matrix of type '<type 'numpy.float64'>'
with 100000000 stored elements in COOrdinate format>    

Filesize: 3.0G.
Run Code Online (Sandbox Code Playgroud)

(请注意,格式已从csr更改为coo).

np.savez/np.load

import numpy as np
from scipy.sparse import csr_matrix

def save_sparse_csr(filename, array):
    # note that .npz extension is added automatically
    np.savez(filename, data=array.data, indices=array.indices,
             indptr=array.indptr, shape=array.shape)

def load_sparse_csr(filename):
    # here we need to add .npz extension manually
    loader = np.load(filename + '.npz')
    return csr_matrix((loader['data'], loader['indices'], loader['indptr']),
                      shape=loader['shape'])


%time save_sparse_csr('test_savez', matrix)
CPU times: user 1.26 s, sys: 1.48 s, total: 2.74 s
Wall time: 2.74 s    

%time matrix = load_sparse_csr('test_savez')
CPU times: user 1.18 s, sys: 548 ms, total: 1.73 s
Wall time: 1.73 s

matrix
<1000000x100000 sparse matrix of type '<type 'numpy.float64'>'
with 100000000 stored elements in Compressed Sparse Row format>

Filesize: 1.1G.
Run Code Online (Sandbox Code Playgroud)

cPickle

import cPickle as pickle

def save_pickle(matrix, filename):
    with open(filename, 'wb') as outfile:
        pickle.dump(matrix, outfile, pickle.HIGHEST_PROTOCOL)
def load_pickle(filename):
    with open(filename, 'rb') as infile:
        matrix = pickle.load(infile)    
    return matrix    

%time save_pickle(matrix, 'test_pickle.mtx')
CPU times: user 260 ms, sys: 888 ms, total: 1.15 s
Wall time: 1.15 s    

%time matrix = load_pickle('test_pickle.mtx')
CPU times: user 376 ms, sys: 988 ms, total: 1.36 s
Wall time: 1.37 s    

matrix
<1000000x100000 sparse matrix of type '<type 'numpy.float64'>'
with 100000000 stored elements in Compressed Sparse Row format>

Filesize: 1.1G.
Run Code Online (Sandbox Code Playgroud)

注意:cPickle不适用于非常大的对象(请参阅此答案).根据我的经验,它不适用于具有270M非零值的2.7M x 50k矩阵. np.savez解决方案运作良好

结论

(基于这种简单的CSR矩阵测试) cPickle是最快的方法,但它不适用于非常大的矩阵,np.savez只是稍微慢一点,而io.mmwrite速度慢得多,产生更大的文件并恢复到错误的格式.np.savez赢家也是如此.

  • 谢谢!只是注意,至少对我来说(Py 2.7.11),来自scipy.sparse import io`的行不起作用.相反,只需要`从scipy import io`.[文件](https://docs.scipy.org/doc/scipy-0.19.0/reference/generated/scipy.io.mmwrite.html) (2认同)

Joe*_*ton 11

假设你在两台机器上都有scipy,你可以使用pickle.

但是,在pickling numpy数组时一定要指定二进制协议.否则你会得到一个巨大的文件.

无论如何,你应该能够做到这一点:

import cPickle as pickle
import numpy as np
import scipy.sparse

# Just for testing, let's make a dense array and convert it to a csr_matrix
x = np.random.random((10,10))
x = scipy.sparse.csr_matrix(x)

with open('test_sparse_array.dat', 'wb') as outfile:
    pickle.dump(x, outfile, pickle.HIGHEST_PROTOCOL)
Run Code Online (Sandbox Code Playgroud)

然后你可以加载它:

import cPickle as pickle

with open('test_sparse_array.dat', 'rb') as infile:
    x = pickle.load(infile)
Run Code Online (Sandbox Code Playgroud)


x0s*_*x0s 9

从scipy 0.19.0开始,您可以通过以下方式保存和加载稀疏矩阵:

from scipy import sparse

data = sparse.csr_matrix((3, 4))

#Save
sparse.save_npz('data_sparse.npz', data)

#Load
data = sparse.load_npz("data_sparse.npz")
Run Code Online (Sandbox Code Playgroud)