从字典计算欧几里得距离(sklearn)

Zel*_*ahl 2 python dictionary numpy euclidean-distance scikit-learn

dictionaries我的代码中已经计算了两个,如下所示:

X = {'a': 10, 'b': 3, 'c': 5, ...}
Y = {'a': 8, 'c': 3, 'e': 8, ...}
Run Code Online (Sandbox Code Playgroud)

实际上,它们包含来自Wiki文本的单词,但这应有助于显示我的意思。它们不一定包含相同的密钥。

最初,我想像这样使用sklearn的成对度量:

from sklearn.metrics.pairwise import pairwise_distances

obama = wiki[wiki['name'] == 'Barack Obama']['tf_idf'][0]
biden = wiki[wiki['name'] == 'Joe Biden']['tf_idf'][0]

obama_biden_distance = pairwise_distances(obama, biden, metric='euclidean', n_jobs=2)[0][0]
Run Code Online (Sandbox Code Playgroud)

但是,这会导致错误:

--------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-124-7ff03bd40683> in <module>()
      6 biden = wiki[wiki['name'] == 'Joe Biden']['tf_idf'][0]
      7 
----> 8 obama_biden_distance = pairwise_distances(obama, biden, metric='euclidean', n_jobs=2)[0][0]

/home/xiaolong/development/anaconda3/envs/coursera_ml_clustering_and_retrieval/lib/python3.4/site-packages/sklearn/metrics/pairwise.py in pairwise_distances(X, Y, metric, n_jobs, **kwds)
   1205         func = partial(distance.cdist, metric=metric, **kwds)
   1206 
-> 1207     return _parallel_pairwise(X, Y, func, n_jobs, **kwds)
   1208 
   1209 

/home/xiaolong/development/anaconda3/envs/coursera_ml_clustering_and_retrieval/lib/python3.4/site-packages/sklearn/metrics/pairwise.py in _parallel_pairwise(X, Y, func, n_jobs, **kwds)
   1058     ret = Parallel(n_jobs=n_jobs, verbose=0)(
   1059         fd(X, Y[s], **kwds)
-> 1060         for s in gen_even_slices(Y.shape[0], n_jobs))
   1061 
   1062     return np.hstack(ret)

AttributeError: 'dict' object has no attribute 'shape'
Run Code Online (Sandbox Code Playgroud)

对我来说,这听起来像是某些东西试图访问该shape属性,而a dict却没有。我猜它需要numpy数组。如果一个字典没有某个键,而另一个字典却具有某个键,我该如何转换字典,以便sklearn函数计算出正确的距离(假设0值)?

jua*_*aga 6

为什么不直接从稀疏表示中直接进行呢?

In [1]: import math

In [2]: Y = {'a': 8, 'c':3,'e':8}

In [3]: X = {'a':10, 'b':3, 'c':5}

In [4]: math.sqrt(sum((X.get(d,0) - Y.get(d,0))**2 for d in set(X) | set(Y)))
Out[4]: 9.0
Run Code Online (Sandbox Code Playgroud)