numpy.ndarray 稀疏矩阵到密集

Question

numpy.ndarray 稀疏矩阵到密集

我想对一些打包为 a 的数据运行sklearn's RandomForestClassifier，而这些数据numpy.ndarray恰好是稀疏的。打电话fit给ValueError: setting an array element with a sequence.。从其他帖子我了解到随机森林无法处理稀疏数据。

我希望该对象有一个todense方法，但它没有。

>>> X_train
array(<1443899x1936774 sparse matrix of type '<class 'numpy.float64'>'
    with 141256894 stored elements in Compressed Sparse Row format>,
      dtype=object)
>>> type(X_train)
<class 'numpy.ndarray'>

Run Code Online (Sandbox Code Playgroud)

我尝试用 SciPy 包装它，csr_matrix但这也会产生错误。

有没有办法让随机森林接受这些数据？（不确定密集实际上是否适合内存，但那是另一回事......）

编辑 1

产生错误的代码是这样的：

X_train = np.load('train.npy') # this returns a ndarray
train_gt = pd.read_csv('train_gt.csv')

model = RandomForestClassifier()
model.fit(X_train, train_gt.target)

Run Code Online (Sandbox Code Playgroud)

至于使用的建议toarray()，ndarray 没有这样的方法。 AttributeError: 'numpy.ndarray' object has no attribute 'toarray'

此外，如前所述，对于这个特定的数据，我需要数 TB 的内存来保存数组。是否可以选择RandomForestClassifier使用稀疏数组运行？

编辑 2

似乎数据应该使用 SciPy 的 sparse保存，如此处所述Save / load scipy sparse csr_matrix in Portable data format。使用 NumPy 的保存/加载时，应该保存更多数据。

Answer 1

hpa*_*ulj 6

>>> X_train
array(<1443899x1936774 sparse matrix of type '<class 'numpy.float64'>'
    with 141256894 stored elements in Compressed Sparse Row format>,
      dtype=object)

Run Code Online (Sandbox Code Playgroud)

意味着你的代码，或者它调用的东西，已经np.array(M)在稀疏矩阵中完成M了csr。它只是将该矩阵包装在一个对象 dtype 数组中。

要在不采用稀疏矩阵的代码中使用稀疏矩阵，您必须首先将它们转换为密集矩阵：

 arr = M.toarray()    # or M.A same thing
 mat = M.todense()    # to make a np.matrix

Run Code Online (Sandbox Code Playgroud)

但是考虑到非零元素的维度和数量，这种转换很可能会产生一个memory error.

归档时间：	6 年，7 月前
查看次数：	6337 次
最近记录：	4 年，10 月前