堆叠两个不同维度的稀疏矩阵

use*_*931 5 python scipy sparse-matrix scikit-learn

sklearn HashVectorizer我有两个稀疏矩阵(由两组特征创建- 每组对应一个特征)。我想将它们连接起来以便稍后使用它们进行聚类。但是,我面临尺寸问题,因为两个矩阵没有相同的行尺寸。

这是一个例子:

Xa = [-0.57735027 -0.57735027  0.57735027 -0.57735027 -0.57735027  0.57735027
  0.5         0.5        -0.5         0.5         0.5        -0.5         0.5
  0.5        -0.5         0.5        -0.5         0.5         0.5        -0.5
  0.5         0.5       ]

Xb = [-0.57735027 -0.57735027  0.57735027 -0.57735027  0.57735027  0.57735027
 -0.5         0.5         0.5         0.5        -0.5        -0.5         0.5
 -0.5        -0.5        -0.5         0.5         0.5       ]
Run Code Online (Sandbox Code Playgroud)

Xa都是Xb类型<class 'scipy.sparse.csr.csr_matrix'>。形状是Xa.shape = (6, 1048576) Xb.shape = (5, 1048576). 我得到的错误是(我现在知道为什么会发生):

    X = hstack((Xa, Xb))
  File "/usr/local/lib/python2.7/site-packages/scipy/sparse/construct.py", line 464, in hstack
    return bmat([blocks], format=format, dtype=dtype)
  File "/usr/local/lib/python2.7/site-packages/scipy/sparse/construct.py", line 581, in bmat
    'row dimensions' % i)
ValueError: blocks[0,:] has incompatible row dimensions
Run Code Online (Sandbox Code Playgroud)

尽管稀疏矩阵的尺寸不规则,有没有办法堆叠它们?也许加一些填充物?

我查看了这些帖子:

Joã*_*ida 5

您可以用空的稀疏矩阵填充它。

您想要水平堆叠它,因此需要填充较小的矩阵,使其具有与较大矩阵相同的行数。为此,您可以将其与 shape 矩阵垂直堆叠(difference in number of rows, number of columns of original matrix)

像这样:

from scipy.sparse import csr_matrix
from scipy.sparse import hstack
from scipy.sparse import vstack

# Create 2 empty sparse matrix for demo
Xa = csr_matrix((4, 4))
Xb = csr_matrix((3, 5))


diff_n_rows = Xa.shape[0] - Xb.shape[0]

Xb_new = vstack((Xb, csr_matrix((diff_n_rows, Xb.shape[1])))) 
#where diff_n_rows is the difference of the number of rows between Xa and Xb

X = hstack((Xa, Xb_new))
X
Run Code Online (Sandbox Code Playgroud)

结果是:

<4x9 sparse matrix of type '<class 'numpy.float64'>'
    with 0 stored elements in COOrdinate format>
Run Code Online (Sandbox Code Playgroud)