Hen*_*ton 76 python numpy scipy
如何csr_matrix
以便携式格式保存/加载scipy稀疏?scipy稀疏矩阵在Python 3(Windows 64位)上创建,以在Python 2(Linux 64位)上运行.最初,我使用了pickle(使用protocol = 2和fix_imports = True),但这从Python 3.2.2(Windows 64位)到Python 2.7.2(Windows 32位)不起作用并得到错误:
TypeError: ('data type not understood', <built-in function _reconstruct>, (<type 'numpy.ndarray'>, (0,), '[98]')).
Run Code Online (Sandbox Code Playgroud)
接下来,尝试过numpy.save
,numpy.load
以及scipy.io.mmwrite()
并且scipy.io.mmread()
这些方法都没有奏效.
Hen*_*ton 106
编辑: SciPy 1.19现在有scipy.sparse.save_npz
和scipy.sparse.load_npz
.
from scipy import sparse
sparse.save_npz("yourmatrix.npz", your_matrix)
your_matrix_back = sparse.load_npz("yourmatrix.npz")
Run Code Online (Sandbox Code Playgroud)
对于这两个函数,file
参数也可以是类文件对象(即结果open
)而不是文件名.
得到了Scipy用户组的回答:
一个csr_matrix有关系3个数据属性:
.data
,.indices
,和.indptr
.所有都是简单的ndarray,因此numpy.save
将对它们起作用.用numpy.save
或保存三个数组,然后numpy.savez
用它们加载numpy.load
,然后使用以下命令重新创建稀疏矩阵对象:Run Code Online (Sandbox Code Playgroud)new_csr = csr_matrix((data, indices, indptr), shape=(M, N))
例如:
def save_sparse_csr(filename, array):
np.savez(filename, data=array.data, indices=array.indices,
indptr=array.indptr, shape=array.shape)
def load_sparse_csr(filename):
loader = np.load(filename)
return csr_matrix((loader['data'], loader['indices'], loader['indptr']),
shape=loader['shape'])
Run Code Online (Sandbox Code Playgroud)
Fra*_*kow 36
虽然你写的,scipy.io.mmwrite
并scipy.io.mmread
不会为你工作,我只想补充,他们是如何工作的.这个问题是否定的.1谷歌击中,所以我自己开始,np.savez
并pickle.dump
切换到简单明显的scipy功能.他们为我工作,不应该被那些没有尝试过的人监督.
from scipy import sparse, io
m = sparse.csr_matrix([[0,0,0],[1,0,0],[0,1,0]])
m # <3x3 sparse matrix of type '<type 'numpy.int64'>' with 2 stored elements in Compressed Sparse Row format>
io.mmwrite("test.mtx", m)
del m
newm = io.mmread("test.mtx")
newm # <3x3 sparse matrix of type '<type 'numpy.int32'>' with 2 stored elements in COOrdinate format>
newm.tocsr() # <3x3 sparse matrix of type '<type 'numpy.int32'>' with 2 stored elements in Compressed Sparse Row format>
newm.toarray() # array([[0, 0, 0], [1, 0, 0], [0, 1, 0]], dtype=int32)
Run Code Online (Sandbox Code Playgroud)
Den*_*zov 25
以下是使用Jupyter笔记本的三个最受欢迎的答案的性能比较.输入是一个1M x 100K随机稀疏矩阵,密度为0.001,包含100M非零值:
from scipy.sparse import random
matrix = random(1000000, 100000, density=0.001, format='csr')
matrix
<1000000x100000 sparse matrix of type '<type 'numpy.float64'>'
with 100000000 stored elements in Compressed Sparse Row format>
Run Code Online (Sandbox Code Playgroud)
io.mmwrite
/io.mmread
from scipy.sparse import io
%time io.mmwrite('test_io.mtx', matrix)
CPU times: user 4min 37s, sys: 2.37 s, total: 4min 39s
Wall time: 4min 39s
%time matrix = io.mmread('test_io.mtx')
CPU times: user 2min 41s, sys: 1.63 s, total: 2min 43s
Wall time: 2min 43s
matrix
<1000000x100000 sparse matrix of type '<type 'numpy.float64'>'
with 100000000 stored elements in COOrdinate format>
Filesize: 3.0G.
Run Code Online (Sandbox Code Playgroud)
(请注意,格式已从csr更改为coo).
np.savez
/np.load
import numpy as np
from scipy.sparse import csr_matrix
def save_sparse_csr(filename, array):
# note that .npz extension is added automatically
np.savez(filename, data=array.data, indices=array.indices,
indptr=array.indptr, shape=array.shape)
def load_sparse_csr(filename):
# here we need to add .npz extension manually
loader = np.load(filename + '.npz')
return csr_matrix((loader['data'], loader['indices'], loader['indptr']),
shape=loader['shape'])
%time save_sparse_csr('test_savez', matrix)
CPU times: user 1.26 s, sys: 1.48 s, total: 2.74 s
Wall time: 2.74 s
%time matrix = load_sparse_csr('test_savez')
CPU times: user 1.18 s, sys: 548 ms, total: 1.73 s
Wall time: 1.73 s
matrix
<1000000x100000 sparse matrix of type '<type 'numpy.float64'>'
with 100000000 stored elements in Compressed Sparse Row format>
Filesize: 1.1G.
Run Code Online (Sandbox Code Playgroud)
cPickle
import cPickle as pickle
def save_pickle(matrix, filename):
with open(filename, 'wb') as outfile:
pickle.dump(matrix, outfile, pickle.HIGHEST_PROTOCOL)
def load_pickle(filename):
with open(filename, 'rb') as infile:
matrix = pickle.load(infile)
return matrix
%time save_pickle(matrix, 'test_pickle.mtx')
CPU times: user 260 ms, sys: 888 ms, total: 1.15 s
Wall time: 1.15 s
%time matrix = load_pickle('test_pickle.mtx')
CPU times: user 376 ms, sys: 988 ms, total: 1.36 s
Wall time: 1.37 s
matrix
<1000000x100000 sparse matrix of type '<type 'numpy.float64'>'
with 100000000 stored elements in Compressed Sparse Row format>
Filesize: 1.1G.
Run Code Online (Sandbox Code Playgroud)
注意:cPickle不适用于非常大的对象(请参阅此答案).根据我的经验,它不适用于具有270M非零值的2.7M x 50k矩阵.
np.savez
解决方案运作良好
(基于这种简单的CSR矩阵测试)
cPickle
是最快的方法,但它不适用于非常大的矩阵,np.savez
只是稍微慢一点,而io.mmwrite
速度慢得多,产生更大的文件并恢复到错误的格式.np.savez
赢家也是如此.
Vic*_*sse 16
现在您可以使用scipy.sparse.save_npz
:https:
//docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.save_npz.html
Joe*_*ton 11
假设你在两台机器上都有scipy,你可以使用pickle
.
但是,在pickling numpy数组时一定要指定二进制协议.否则你会得到一个巨大的文件.
无论如何,你应该能够做到这一点:
import cPickle as pickle
import numpy as np
import scipy.sparse
# Just for testing, let's make a dense array and convert it to a csr_matrix
x = np.random.random((10,10))
x = scipy.sparse.csr_matrix(x)
with open('test_sparse_array.dat', 'wb') as outfile:
pickle.dump(x, outfile, pickle.HIGHEST_PROTOCOL)
Run Code Online (Sandbox Code Playgroud)
然后你可以加载它:
import cPickle as pickle
with open('test_sparse_array.dat', 'rb') as infile:
x = pickle.load(infile)
Run Code Online (Sandbox Code Playgroud)
从scipy 0.19.0开始,您可以通过以下方式保存和加载稀疏矩阵:
from scipy import sparse
data = sparse.csr_matrix((3, 4))
#Save
sparse.save_npz('data_sparse.npz', data)
#Load
data = sparse.load_npz("data_sparse.npz")
Run Code Online (Sandbox Code Playgroud)
归档时间: |
|
查看次数: |
47973 次 |
最近记录: |