use*_*306 5 python numpy scipy sparse-matrix scikit-learn
我正在尝试计算从scikit-learn返回的Scipy稀疏矩阵上的最近邻居聚类DictVectorizer.但是,当我尝试使用scikit-learn计算距离矩阵时,我会使用' pairwise.euclidean_distances和'的'euclidean'距离得到一条错误信息pairwise.pairwise_distances.我的印象是scikit-learn可以计算这些距离矩阵.
我的矩阵非常稀疏,形状为:<364402x223209 sparse matrix of type <class 'numpy.float64'>
with 728804 stored elements in Compressed Sparse Row format>.
我也曾尝试的方法,如pdist与kdtree在SciPy的,但已经收到不能够处理结果的其他错误.
任何人都可以请我指出一个有效地允许我计算距离矩阵和/或最近邻结果的解决方案吗?
一些示例代码:
import numpy as np
from sklearn.feature_extraction import DictVectorizer
from sklearn.neighbors import NearestNeighbors
from sklearn.metrics import pairwise
import scipy.spatial
file = 'FileLocation'
data = []
FILE = open(file,'r')
for line in FILE:
templine = line.strip().split(',')
data.append({'user':str(int(templine[0])),str(int(templine[1])):int(templine[2])})
FILE.close()
vec = DictVectorizer()
X = vec.fit_transform(data)
result = scipy.spatial.KDTree(X)
Run Code Online (Sandbox Code Playgroud)
错误:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/site-packages/scipy/spatial/kdtree.py", line 227, in __init__
self.n, self.m = np.shape(self.data)
ValueError: need more than 0 values to unpack
Run Code Online (Sandbox Code Playgroud)
同样,如果我跑:
scipy.spatial.distance.pdist(X,'euclidean')
Run Code Online (Sandbox Code Playgroud)
我得到以下内容:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/site-packages/scipy/spatial/distance.py", line 1169, in pdist
[X] = _copy_arrays_if_base_present([_convert_to_double(X)])
File "/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/site-packages/scipy/spatial/distance.py", line 113, in _convert_to_double
X = X.astype(np.double)
ValueError: setting an array element with a sequence.
Run Code Online (Sandbox Code Playgroud)
最后,使用NearestNeighborscikit-learn 运行会导致内存错误:
nbrs = NearestNeighbors(n_neighbors=10, algorithm='brute')
Run Code Online (Sandbox Code Playgroud)
首先,您不能使用KDTree稀疏pdist矩阵,您必须将其转换为密集矩阵(您的选择是否是您的选择):
>>> X
<2x3 sparse matrix of type '<type 'numpy.float64'>'
with 4 stored elements in Compressed Sparse Row format>
>>> scipy.spatial.KDTree(X.todense())
<scipy.spatial.kdtree.KDTree object at 0x34d1e10>
>>> scipy.spatial.distance.pdist(X.todense(),'euclidean')
array([ 6.55743852])
Run Code Online (Sandbox Code Playgroud)
其次,来自文档:
对于小数据样本来说,高效的强力邻居搜索可能非常有竞争力。然而,随着样本数量 N 的增加,暴力方法很快变得不可行。
您可能想尝试“ball_tree”算法,看看它是否可以处理您的数据。
| 归档时间: |
|
| 查看次数: |
4945 次 |
| 最近记录: |