Scipy稀疏 - 距离矩阵(Scikit或Scipy)

use*_*306 5 python numpy scipy sparse-matrix scikit-learn

我正在尝试计算从scikit-learn返回的Scipy稀疏矩阵上的最近邻居聚类DictVectorizer.但是,当我尝试使用scikit-learn计算距离矩阵时,我会使用' pairwise.euclidean_distances和'的'euclidean'距离得到一条错误信息pairwise.pairwise_distances.我的印象是scikit-learn可以计算这些距离矩阵.

我的矩阵非常稀疏,形状为:<364402x223209 sparse matrix of type <class 'numpy.float64'> with 728804 stored elements in Compressed Sparse Row format>.

我也曾尝试的方法,如pdistkdtree在SciPy的,但已经收到不能够处理结果的其他错误.

任何人都可以请我指出一个有效地允许我计算距离矩阵和/或最近邻结果的解决方案吗?

一些示例代码:

import numpy as np
from sklearn.feature_extraction import DictVectorizer
from sklearn.neighbors import NearestNeighbors
from sklearn.metrics import pairwise
import scipy.spatial

file = 'FileLocation'
data = []
FILE = open(file,'r')
for line in FILE:
    templine = line.strip().split(',')
    data.append({'user':str(int(templine[0])),str(int(templine[1])):int(templine[2])})
FILE.close()

vec = DictVectorizer()
X = vec.fit_transform(data)

result = scipy.spatial.KDTree(X)
Run Code Online (Sandbox Code Playgroud)

错误:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/site-packages/scipy/spatial/kdtree.py", line 227, in __init__
    self.n, self.m = np.shape(self.data)
ValueError: need more than 0 values to unpack
Run Code Online (Sandbox Code Playgroud)

同样,如果我跑:

scipy.spatial.distance.pdist(X,'euclidean')
Run Code Online (Sandbox Code Playgroud)

我得到以下内容:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/site-packages/scipy/spatial/distance.py", line 1169, in pdist
    [X] = _copy_arrays_if_base_present([_convert_to_double(X)])
  File "/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/site-packages/scipy/spatial/distance.py", line 113, in _convert_to_double
    X = X.astype(np.double)
ValueError: setting an array element with a sequence.
Run Code Online (Sandbox Code Playgroud)

最后,使用NearestNeighborscikit-learn 运行会导致内存错误:

nbrs = NearestNeighbors(n_neighbors=10, algorithm='brute')
Run Code Online (Sandbox Code Playgroud)

alk*_*lko 4

首先,您不能使用KDTree稀疏pdist矩阵,您必须将其转换为密集矩阵(您的选择是否是您的选择):

>>> X
<2x3 sparse matrix of type '<type 'numpy.float64'>'
        with 4 stored elements in Compressed Sparse Row format>

>>> scipy.spatial.KDTree(X.todense())
<scipy.spatial.kdtree.KDTree object at 0x34d1e10>
>>> scipy.spatial.distance.pdist(X.todense(),'euclidean')
array([ 6.55743852])
Run Code Online (Sandbox Code Playgroud)

其次,来自文档

对于小数据样本来说,高效的强力邻居搜索可能非常有竞争力。然而,随着样本数量 N 的增加,暴力方法很快变得不可行。

您可能想尝试“ball_tree”算法,看看它是否可以处理您的数据。

  • @user2694306你的**欧几里得**距离矩阵必须是密集的(我猜可能没有任何零值),所以它必须跨越超过74 Gb的内存。我怀疑这是否可能。 (3认同)