无监督学习聚类一维数组

Question

无监督学习聚类一维数组

dre*_*934 -2 python cluster-analysis unsupervised-learning scikit-learn

我面临以下数组：

y = [1,2,4,7,9,5,4,7,9,56,57,54,60,200,297,275,243]

Run Code Online (Sandbox Code Playgroud)

我想做的是提取得分最高的集群。那将是

best_cluster = [200,297,275,243]

Run Code Online (Sandbox Code Playgroud)

我已经检查了很多关于这个主题的堆栈问题，其中大多数建议使用 kmeans。尽管其他一些人提到 kmeans 可能对一维数组聚类来说是一种矫枉过正。然而，kmeans 是一种监督学习算法，因此这意味着我必须传入质心的数量。由于我需要将此问题推广到其他数组，因此我无法为每个数组传递质心数。因此，我正在考虑实施某种无监督学习算法，该算法能够自行找出集群并选择最高的集群。在数组 y 中，我会看到 3 个集群 [1,2,4,7,9,5,4,7,9],[56,57,54,60],[200,297,275,243]。考虑到计算成本和准确性以及我如何为我的问题实现它，哪种算法最适合我的需求？

Answer 1

Eas*_*onL 5

试试MeanShift。来自MeanShift的 sklean用户指南：

该算法自动设置簇数，...

修改后的演示代码：

import numpy as np
from sklearn.cluster import MeanShift, estimate_bandwidth

# #############################################################################
# Generate sample data
X = [1,2,4,7,9,5,4,7,9,56,57,54,60,200,297,275,243]
X = np.reshape(X, (-1, 1))

# #############################################################################
# Compute clustering with MeanShift

# The following bandwidth can be automatically detected using
# bandwidth = estimate_bandwidth(X, quantile=0.2, n_samples=100)

ms = MeanShift(bandwidth=None, bin_seeding=True)
ms.fit(X)
labels = ms.labels_
cluster_centers = ms.cluster_centers_

labels_unique = np.unique(labels)
n_clusters_ = len(labels_unique)

print("number of estimated clusters : %d" % n_clusters_)
print(labels)

Run Code Online (Sandbox Code Playgroud)

输出：

number of estimated clusters : 2
[0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1]

Run Code Online (Sandbox Code Playgroud)

请注意，MeanShift无法随样本数量进行扩展。建议上限为 10,000。

顺便说一句，正如 rahlf23 已经提到的，K-mean 是一种无监督学习算法。您必须指定集群数量的事实并不意味着它是受监督的。

也可以看看：

归档时间：	7 年，9 月前
查看次数：	5808 次
最近记录：	7 年，9 月前