ppl*_*lat 5 nearest-neighbor python-2.7 cosine-similarity scikit-learn
我试图使用scikit的最近邻实现,从随机值矩阵中找到最接近给定列向量的列向量.
该代码应该找到第21列的最近邻居,然后检查这些邻居与第21列的实际余弦相似性.
from sklearn.neighbors import NearestNeighbors
import sklearn.metrics.pairwise as smp
import numpy as np
test=np.random.randint(0,5,(50,50))
nbrs = NearestNeighbors(n_neighbors=5, algorithm='auto', metric=smp.cosine_similarity).fit(test)
distances, indices = nbrs.kneighbors(test)
x=21
for idx,d in enumerate(indices[x]):
sim2 = smp.cosine_similarity(test[:,x],test[:,d])
print "sklearns cosine similarity would be ", sim2
print 'sklearns reported distance is', distances[x][idx]
print 'sklearns if that distance was cosine, the similarity would be: ' ,1- distances[x][idx]
Run Code Online (Sandbox Code Playgroud)
输出看起来像
sklearns cosine similarity would be [[ 0.66190748]]
sklearns reported distance is 0.616586738214
sklearns if that distance was cosine, the similarity would be: 0.383413261786
Run Code Online (Sandbox Code Playgroud)
因此,邻居的输出既不是余弦距离,也不是余弦相似度.是什么赋予了?
另外,另外,我认为sklearn的Nearest Neighbors实现不是近似最近邻居方法,但它似乎没有检测到我的数据集中的实际最佳邻居,相比之下,如果我迭代矩阵并检查我得到的结果列211与所有其他列的相似之处.我误解了一些基本的东西吗?
好的问题是,NearestNeighbors的.fit()方法默认假设行是样本而列是特征.我必须在将矩阵传递到适合之前对其进行转置.
编辑:另外,另一个问题是作为度量传递的可调用者应该是可调用的距离,而不是可调用的相似性.否则你会得到K最远的邻居:/
| 归档时间: |
|
| 查看次数: |
4723 次 |
| 最近记录: |