Kmeans Euclidean Distance to Each Centroid Avoid Splitting Features From Rest of DF

Ove*_*ass 0 python k-means python-3.x pandas scikit-learn

I have a df:

    id      Type1   Type2   Type3   
0   10000   0.0     0.00    0.00    
1   10001   0.0     63.72   0.00    
2   10002   473.6   174.00  31.60   
3   10003   0.0     996.00  160.92  
4   10004   0.0     524.91  0.00
Run Code Online (Sandbox Code Playgroud)

I apply k-means to this df and add the resulting cluster to the df:

kmeans = cluster.KMeans(n_clusters=5, random_state=0).fit(df.drop('id', axis=1))
df['cluster'] = kmeans.labels_
Run Code Online (Sandbox Code Playgroud)

Now I'm attempting to add columns to the df for the Euclidean distance between each point (i.e. row in the df) and each centroid:

def distance_to_centroid(row, centroid):
    row = row[['Type1',
               'Type2',
               'Type3']]
    return euclidean(row, centroid)

df['distance_to_center_0'] = df.apply(lambda r: distance_to_centroid(r, kmeans.cluster_centers_[0]),1)
Run Code Online (Sandbox Code Playgroud)

This results in this error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-34-56fa3ae3df54> in <module>()
----> 1 df['distance_to_center_0'] = df.apply(lambda r: distance_to_centroid(r, kmeans.cluster_centers_[0]),1)

~\_installed\anaconda\lib\site-packages\pandas\core\frame.py in apply(self, func, axis, broadcast, raw, reduce, result_type, args, **kwds)
   6002                          args=args,
   6003                          kwds=kwds)
-> 6004         return op.get_result()
   6005 
   6006     def applymap(self, func):

~\_installed\anaconda\lib\site-packages\pandas\core\apply.py in get_result(self)
    140             return self.apply_raw()
    141 
--> 142         return self.apply_standard()
    143 
    144     def apply_empty_result(self):

~\_installed\anaconda\lib\site-packages\pandas\core\apply.py in apply_standard(self)
    246 
    247         # compute the result using the series generator
--> 248         self.apply_series_generator()
    249 
    250         # wrap results

~\_installed\anaconda\lib\site-packages\pandas\core\apply.py in apply_series_generator(self)
    275             try:
    276                 for i, v in enumerate(series_gen):
--> 277                     results[i] = self.f(v)
    278                     keys.append(v.name)
    279             except Exception as e:

<ipython-input-34-56fa3ae3df54> in <lambda>(r)
----> 1 df['distance_to_center_0'] = df.apply(lambda r: distance_to_centroid(r, kmeans.cluster_centers_[0]),1)

<ipython-input-33-7b988ca2ad8c> in distance_to_centroid(row, centroid)
      7                 'atype',
      8                 'anothertype']]
----> 9     return euclidean(row, centroid)

~\_installed\anaconda\lib\site-packages\scipy\spatial\distance.py in euclidean(u, v, w)
    596 
    597     """
--> 598     return minkowski(u, v, p=2, w=w)
    599 
    600 

~\_installed\anaconda\lib\site-packages\scipy\spatial\distance.py in minkowski(u, v, p, w)
    488     if p < 1:
    489         raise ValueError("p must be at least 1")
--> 490     u_v = u - v
    491     if w is not None:
    492         w = _validate_weights(w)

ValueError: ('operands could not be broadcast together with shapes (7,) (8,) ', 'occurred at index 0')
Run Code Online (Sandbox Code Playgroud)

This error appears to be happening because id is not included in the row variable in the function distance_to_centroid. To fix this, I could split the df into two parts (id in df1 and the rest of the columns in df2). However, this is very manual, and does not allow for easy changes of columns. Is there a way to get the distance to each centroid into the original df without splitting the original df? In the same vein, is there a better way to find the euclidean distance that wouldn't involve manually entering the columns into the row variable, as well as manually creating however many columns as clusters?

Expected Result:

    id      Type1   Type2   Type3   cluster    distanct_to_cluster_0
0   10000   0.0     0.00    0.00    1          2.3
1   10001   0.0     63.72   0.00    2          3.6 
2   10002   473.6   174.00  31.60   0          0.5 
3   10003   0.0     996.00  160.92  3          3.7 
4   10004   0.0     524.91  0.00    4          1.8  
Run Code Online (Sandbox Code Playgroud)

unu*_*tbu 5

我们需要将 的坐标部分传递dfKMeans,并且我们想要仅使用 的坐标部分来计算到质心的距离df。所以我们不妨为这个数量定义一个变量:

points = df.drop('id', axis=1)
# or points = df[['Type1', 'Type2', 'Type3']]
Run Code Online (Sandbox Code Playgroud)

然后我们可以使用以下方法计算从每一行的坐标部分到其相应质心的距离:

import scipy.spatial.distance as sdist
centroids = kmeans.cluster_centers_
dist = sdist.norm(points - centroids[df['cluster']])
Run Code Online (Sandbox Code Playgroud)

请注意,它centroids[df['cluster']]返回一个与 形状相同的 NumPy 数组points。通过df['cluster']“扩展”centroids数组进行索引。

然后我们可以dist使用这些值将这些值分配给DataFrame 列

df['dist'] = dist
Run Code Online (Sandbox Code Playgroud)

例如,

import numpy as np
import pandas as pd
import sklearn.cluster as cluster
import scipy.spatial.distance as sdist

df = pd.DataFrame({'Type1': [0.0, 0.0, 473.6, 0.0, 0.0],
 'Type2': [0.0, 63.72, 174.0, 996.0, 524.91],
 'Type3': [0.0, 0.0, 31.6, 160.92, 0.0],
 'id': [1000, 10001, 10002, 10003, 10004]})

points = df.drop('id', axis=1)
# or points = df[['Type1', 'Type2', 'Type3']]
kmeans = cluster.KMeans(n_clusters=5, random_state=0).fit(points)
df['cluster'] = kmeans.labels_

centroids = kmeans.cluster_centers_
dist = sdist.norm(points - centroids[df['cluster']])
df['dist'] = dist

print(df)
Run Code Online (Sandbox Code Playgroud)

产量

   Type1   Type2   Type3     id  cluster          dist
0    0.0    0.00    0.00   1000        4  2.842171e-14
1    0.0   63.72    0.00  10001        2  2.842171e-14
2  473.6  174.00   31.60  10002        1  2.842171e-14
3    0.0  996.00  160.92  10003        3  2.842171e-14
4    0.0  524.91    0.00  10004        0  2.842171e-14
Run Code Online (Sandbox Code Playgroud)

如果您想要从每个点到每个聚类质心的距离,您可以使用sdist.cdist

import scipy.spatial.distance as sdist
sdist.cdist(points, centroids)
Run Code Online (Sandbox Code Playgroud)

例如,

import numpy as np
import pandas as pd
import sklearn.cluster as cluster
import scipy.spatial.distance as sdist

df = pd.DataFrame({'Type1': [0.0, 0.0, 473.6, 0.0, 0.0],
 'Type2': [0.0, 63.72, 174.0, 996.0, 524.91],
 'Type3': [0.0, 0.0, 31.6, 160.92, 0.0],
 'id': [1000, 10001, 10002, 10003, 10004]})

points = df.drop('id', axis=1)
# or points = df[['Type1', 'Type2', 'Type3']]
kmeans = cluster.KMeans(n_clusters=5, random_state=0).fit(points)
df['cluster'] = kmeans.labels_

centroids = kmeans.cluster_centers_
dists = pd.DataFrame(
    sdist.cdist(points, centroids), 
    columns=['dist_{}'.format(i) for i in range(len(centroids))],
    index=df.index)
df = pd.concat([df, dists], axis=1)

print(df)
Run Code Online (Sandbox Code Playgroud)

产量

   Type1   Type2   Type3     id  cluster      dist_0      dist_1        dist_2       dist_3       dist_4
0    0.0    0.00    0.00   1000        4  524.910000  505.540819  6.372000e+01  1008.915877     0.000000
1    0.0   63.72    0.00  10001        2  461.190000  487.295802  2.842171e-14   946.066195    63.720000
2  473.6  174.00   31.60  10002        1  590.282431    0.000000  4.872958e+02   957.446929   505.540819
3    0.0  996.00  160.92  10003        3  497.816266  957.446929  9.460662e+02     0.000000  1008.915877
4    0.0  524.91    0.00  10004        0    0.000000  590.282431  4.611900e+02   497.816266   524.910000
Run Code Online (Sandbox Code Playgroud)