使用具有余弦相似性的K-means - Python

ise*_*372 10 python k-means cosine-similarity scikit-learn sklearn-pandas


我试图Kmeans在python中实现算法,它将使用cosine distance而不是欧几里德距离作为距离度量.
我知道使用不同的距离函数可能是致命的,应该仔细进行.使用余弦距离作为度量迫使我改变平均函数(根据余弦距离的平均值必须是归一化向量的元素平均值的元素).

我已经看到了这种手动覆盖sklearn的距离函数的优雅解决方案,我想使用相同的技术来覆盖代码的平均部分,但我找不到它.

有谁知道怎么做?
距离度量不满足三角不等式有多重要?
如果有人知道kmeans的不同有效实现,我使用余弦度量或满足距离和平均函数,它也将是真正有用的.
非常感谢你!

编辑:
使用角距离而不是余弦距离后,代码看起来像这样:

def KMeans_cosine_fit(sparse_data, nclust = 10, njobs=-1, randomstate=None):
    # Manually override euclidean
    def euc_dist(X, Y = None, Y_norm_squared = None, squared = False):
        #return pairwise_distances(X, Y, metric = 'cosine', n_jobs = 10)
        return np.arccos(cosine_similarity(X, Y))/np.pi
    k_means_.euclidean_distances = euc_dist
    kmeans = k_means_.KMeans(n_clusters = nclust, n_jobs = njobs, random_state = randomstate)
    _ = kmeans.fit(sparse_data)
    return kmeans
Run Code Online (Sandbox Code Playgroud)

我注意到(通过数学计算)如果向量被归一化,则标准平均值适用于角度量.据我了解,我必须改变_mini_batch_step()k_means_.py.但功能非常复杂,我无法理解如何做到这一点.
有谁知道替代解决方案?
或许,有没有人知道我怎么能用一个总是迫使质心标准化的功能来编辑这个功能?

小智 7

您可以标准化数据,然后使用 KMeans。

from sklearn import preprocessing
from sklearn.cluster import KMeans

kmeans = KMeans().fit(preprocessing.normalize(X))
Run Code Online (Sandbox Code Playgroud)


sac*_*ruk 6

So it turns out you can just normalise X to be of unit length and use K-means as normal. The reason being if X1 and X2 are unit vectors, looking at the following equation, the term inside the brackets in the last line is cosine distance. vect_dist

So in terms of using k-means, simply do:

length = np.sqrt((X**2).sum(axis=1))[:,None]
X = X / length

kmeans = KMeans(n_clusters=10, random_state=0).fit(X)
Run Code Online (Sandbox Code Playgroud)

And if you need the centroids and distance matrix do:

len_ = np.sqrt(np.square(kmeans.cluster_centers_).sum(axis=1)[:,None])
centers = kmeans.cluster_centers_ / len_
dist = 1 - np.dot(centers, X.T) # K x N matrix of cosine distances
Run Code Online (Sandbox Code Playgroud)

Notes:

  • Just realised that you are trying to minimise the distance between the mean vector of the cluster, and its constituents. The mean vector has length of less than one when you simply average the vectors. But in practice, it's still worth running the normal sklearn algorithm and checking the length of the mean vector. In my case the mean vectors were close to unit length (averaging around 0.9, but this depends on how dense your data is). TLDR: Use the spherecluster package as @??? pointed out.

  • 我们的朋友在 Cross Validated 上的相关讨论 --> https://stats.stackexchange.com/a/146279/243511 (2认同)

Gil*_*Ahn 1

很不幸的是,不行。Sklearn 当前的 k-means 实现仅使用欧几里德距离。

原因是K-means包含了寻找聚类中心并将样本分配给距离最近的中心的计算,而欧几里得只有样本间中心的含义。

如果你想使用带有余弦距离的K-means,你需要创建自己的函数或类。或者,尝试使用其他聚类算法,例如 DBSCAN。