如何使用 python 将标签簇与 K-Means 的真实标签进行匹配

Question

如何使用 python 将标签簇与 K-Means 的真实标签进行匹配

Shy*_*ono 6 python numpy k-means flask scikit-learn

我在使用 Kmeans 算法处理标签数据时遇到问题。我的测试句子得到了真实的聚类，但我没有得到真实的标签。我已经使用 numpy 将簇与 true_label_test 进行匹配，但是这个 kmeans 可以移动簇，真实标签与簇的数量不匹配。我需要帮助解决这个问题。这是我的代码

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.cluster import KMeans
from nltk.corpus import stopwords 
from nltk.stem.wordnet import WordNetLemmatizer
import string
import re
import numpy as np
from collections import Counter

stop = set(stopwords.words('indonesian'))
exclude = set(string.punctuation) 
lemma = WordNetLemmatizer()

# Cleaning the text sentences so that punctuation marks, stop words & digits are removed  
def clean(doc):
    stop_free = " ".join([i for i in doc.lower().split() if i not in stop])
    punc_free = ''.join(ch for ch in stop_free if ch not in exclude)
    normalized = " ".join(lemma.lemmatize(word) for word in punc_free.split())
    processed = re.sub(r"\d+","",normalized)
    y = processed.split()
    #print (y)
    return y

path = "coba.txt"

train_clean_sentences = []
fp = open(path,'r')
for line in fp:
    line = line.strip()
    cleaned = clean(line)
    cleaned = ' '.join(cleaned)
    train_clean_sentences.append(cleaned)

#print(train_clean_sentences)
       
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(train_clean_sentences)

# Clustering the training 30 sentences with K-means technique
modelkmeans = KMeans(n_clusters=3, init='k-means++', max_iter=200, n_init=100)
modelkmeans.fit(X)

teks_satu = "Aplikasi Machine Learning untuk mengenali daun mangga dengan metode CNN"

test_clean_sentence = []

cleaned_test = clean(teks_satu)
cleaned = ' '.join(cleaned_test)
cleaned = re.sub(r"\d+","",cleaned)
test_clean_sentence.append(cleaned)
    
Test = vectorizer.transform(test_clean_sentence) 

true_test_labels = ['AI','VR','Sistem Informasi']

predicted_labels_kmeans = modelkmeans.predict(Test)
print(predicted_labels_kmeans)

print ("\n-------------------------------PREDICTIONS BY K-Means--------------------------------------")
print ("\nIndex of Virtual Reality : ",Counter(modelkmeans.labels_[5:10]).most_common(1)[0][0])
print ("Index of Machine Learning : ",Counter(modelkmeans.labels_[0:5]).most_common(1)[0][0]) 
print ("Index of Sistem Informasi : ",Counter(modelkmeans.labels_[10:15]).most_common(1)[0][0])
print ("\n",teks_satu,":",true_test_labels[np.int(predicted_labels_kmeans)],":",predicted_labels_kmeans)

Run Code Online (Sandbox Code Playgroud)

Answer 1

Alb*_*ieu 1

下面是一个具体示例，展示了如何将KMeans集群 id 与训练数据标签进行匹配。基本思想是confusion_matrix假设分类正确完成，其对角线上应具有较大的值。这是将聚类中心 id 与训练标签关联之前的混淆矩阵：

cm = 
array([[  0, 395,   0,   5,   0],
       [  0,   2,   5, 391,   2],
       [  2,   0,   0,   0, 398],
       [  0,   0, 400,   0,   0],
       [398,   0,   0,   0,   2]])

Run Code Online (Sandbox Code Playgroud)

现在我们只需要对混淆矩阵重新排序，使其大值重新定位在对角线上。它可以轻松实现

cm_argmax = cm.argmax(axis=0)
cm_argmax
y_pred_ = np.array([cm_argmax[i] for i in y_pred])

Run Code Online (Sandbox Code Playgroud)

这里我们得到了新的混淆矩阵，现在看起来很熟悉，对吧？

cm_ = 
array([[395,   5,   0,   0,   0],
       [  2, 391,   2,   5,   0],
       [  0,   0, 398,   0,   2],
       [  0,   0,   0, 400,   0],
       [  0,   0,   2,   0, 398]])

Run Code Online (Sandbox Code Playgroud)

您可以进一步验证结果accuracy_score

y_pred_ = np.array([cm_argmax[i] for i in y_pred])
accuracy_score(y,y_pred_)
# 0.991

Run Code Online (Sandbox Code Playgroud)

完整的独立代码在这里：

import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
from sklearn.metrics import confusion_matrix,accuracy_score
blob_centers = np.array(
    [[ 0.2,  2.3],
     [-1.5 ,  2.3],
     [-2.8,  1.8],
     [-2.8,  2.8],
     [-2.8,  1.3]])
blob_std = np.array([0.4, 0.3, 0.1, 0.1, 0.1])
X, y = make_blobs(n_samples=2000, centers=blob_centers,
                  cluster_std=blob_std, random_state=7)

def plot_clusters(X, y=None):
    plt.scatter(X[:, 0], X[:, 1], c=y, s=1)
    plt.xlabel("$x_1$", fontsize=14)
    plt.ylabel("$x_2$", fontsize=14, rotation=0)

plt.figure(figsize=(8, 4))
plot_clusters(X)
plt.show()

k = 5
kmeans = KMeans(n_clusters=k, random_state=42)
y_pred = kmeans.fit_predict(X)
cm = confusion_matrix(y, y_pred)
cm
cm_argmax = cm.argmax(axis=0)
cm_argmax
y_pred_ = np.array([cm_argmax[i] for i in y_pred])
cm_ = confusion_matrix(y, y_pred)
cm_
accuracy_score(y,y_pred_)

Run Code Online (Sandbox Code Playgroud)

归档时间：	5 年，6 月前
查看次数：	5371 次
最近记录：	2 年，1 月前