用惯性替换轮廓

Tes*_*est 5 python cluster-analysis machine-learning k-means unsupervised-learning

我有个问题。我正在合作k-means并希望找到最佳的集群。不幸的是,我的数据集太大,无法应用silhouette 。是否可以选择调整此代码并将 替换silhouetteInertia

多维控制器

from sklearn.cluster import KMeans
import numpy as np
from sklearn.metrics import silhouette_score
import matplotlib as mpl
import matplotlib.pyplot as plt

X = np.array([[1, 2], [1, 4], [1, 0],
              [10, 2], [10, 4], [10, 0],
              [10, 2], [10, 4], [10, 0],
              [1, 2], [1, 4], [1, 0],
              [10, 2], [10, 4], [10, 0],
              [10, 2], [10, 4], [10, 0],
              [1, 2], [1, 4], [1, 0],
              [10, 2], [10, 4], [10, 0],
              [10, 2], [10, 4], [10, 0],
              [1, 2], [1, 4], [1, 0],])

kmeans_per_k = [KMeans(n_clusters=k, random_state=42).fit(X)
                for k in range(1, 10)]
inertias = [model.inertia_ for model in kmeans_per_k]

silhouette_scores = [silhouette_score(X, model.labels_)
                     for model in kmeans_per_k[1:]]


from sklearn.metrics import silhouette_samples
from matplotlib.ticker import FixedLocator, FixedFormatter

plt.figure(figsize=(11, 9))

for k in (3, 4, 5, 6):
    plt.subplot(2, 2, k - 2)
    
    y_pred = kmeans_per_k[k - 1].labels_
    silhouette_coefficients = silhouette_samples(X, y_pred)

    padding = len(X) // 30
    pos = padding
    ticks = []
    for i in range(k):
        coeffs = silhouette_coefficients[y_pred == i]
        coeffs.sort()

        color = mpl.cm.Spectral(i / k)
        plt.fill_betweenx(np.arange(pos, pos + len(coeffs)), 0, coeffs,
                          facecolor=color, edgecolor=color, alpha=0.7)
        ticks.append(pos + len(coeffs) // 2)
        pos += len(coeffs) + padding

    plt.gca().yaxis.set_major_locator(FixedLocator(ticks))
    plt.gca().yaxis.set_major_formatter(FixedFormatter(range(k)))
    if k in (3, 5):
        plt.ylabel("Cluster")
    
    if k in (5, 6):
        plt.gca().set_xticks([-0.1, 0, 0.2, 0.4, 0.6, 0.8, 1])
        plt.xlabel("Silhouette Coefficient")
    else:
        plt.tick_params(labelbottom=False)

    plt.axvline(x=silhouette_scores[k - 2], color="red", linestyle="--")
    plt.title("$k={}$".format(k), fontsize=16)

#save_fig("silhouette_analysis_plot")
plt.show()
Run Code Online (Sandbox Code Playgroud)

我想要什么Inertia 在此输入图像描述

met*_*eti 2

首先,我建议使用参数sample_sizerandom_state(为了重现性)计算数据子集的轮廓分数。这可以节省您一些时间,同时计算和绘制相当全面的信息。(如何使用)。但正如您所知,有很多选项可以用来衡量聚类质量以及可视化。您提到的肘部(惯性)可以这样使用:

import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
X, y = make_blobs(n_samples=100, centers=3, n_features=2,
                  random_state=0)
scores = [KMeans(n_clusters=i+2).fit(X).inertia_ 
          for i in range(10)]
sns.lineplot(np.arange(2, 12), scores)
plt.xlabel('Number of clusters')
plt.ylabel("Inertia")
plt.title("Inertia of k-Means versus number of clusters")
Run Code Online (Sandbox Code Playgroud)

在此输入图像描述

本文介绍了几种有用简单的技术来获取聚类质量。