Ari*_*Ari 3 python cluster-analysis k-means
我已经将一大堆 PDF 文档转换为文本,然后将它们编译为字典,我知道我有 3 种不同的文档类型,我想使用聚类来自动对它们进行分组:
dict_of_docs = {'document_1':'contents of document', 'document_2':'contents of document', 'document_3':'contents of document',...'document_100':'contents of document'}
Run Code Online (Sandbox Code Playgroud)
然后,我对字典的值进行了向量化:
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(dict_of_docs.values())
Run Code Online (Sandbox Code Playgroud)
我的 X 输出是这样的:
(0, 768) 0.05895270500636258
(0, 121) 0.11790541001272516
(0, 1080) 0.05895270500636258
(0, 87) 0.2114378682212116
(0, 1458) 0.1195944498355368
(0, 683) 0.0797296332236912
(0, 1321) 0.12603709835806634
(0, 630) 0.12603709835806634
(0, 49) 0.12603709835806634
(0, 750) 0.12603709835806634
(0, 1749) 0.10626171032944469
(0, 478) 0.12603709835806634
(0, 1632) 0.14983692373373858
(0, 177) 0.12603709835806634
(0, 653) 0.0497440271723707
(0, 1268) 0.13342186854440274
(0, 1489) 0.07052056544031632
(0, 72) 0.12603709835806634
...etc etc
Run Code Online (Sandbox Code Playgroud)
然后,我将它们转换为数组,X = X.toarray()
我现在正处于尝试使用我的真实数据通过 matplotlib 散点绘制集群的阶段。然后,我想使用我在聚类中学到的知识来对文档进行排序。我遵循的所有指南都使用组成的数据数组,但它们没有展示如何从现实世界的数据到可以按照它们演示的方式使用的数据。
如何将矢量化数据数组绘制成散点图?
如何将矢量化数据数组绘制成散点图?
只需几个步骤:聚类、降维、绘图和调试。
我们使用 K-Means 来拟合X(我们的TF-IDF矢量化数据集)。
from sklearn.cluster import KMeans
NUMBER_OF_CLUSTERS = 3
km = KMeans(
n_clusters=NUMBER_OF_CLUSTERS,
init='k-means++',
max_iter=500)
km.fit(X)
Run Code Online (Sandbox Code Playgroud)
from sklearn.decomposition import PCA
# First: for every document we get its corresponding cluster
clusters = km.predict(X)
# We train the PCA on the dense version of the tf-idf.
pca = PCA(n_components=2)
two_dim = pca.fit_transform(X.todense())
scatter_x = two_dim[:, 0] # first principle component
scatter_y = two_dim[:, 1] # second principle component
Run Code Online (Sandbox Code Playgroud)
我们用预先指定的颜色绘制每个簇。
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('ggplot')
fig, ax = plt.subplots()
fig.set_size_inches(20,10)
# color map for NUMBER_OF_CLUSTERS we have
cmap = {0: 'green', 1: 'blue', 2: 'red'}
# group by clusters and scatter plot every cluster
# with a colour and a label
for group in np.unique(clusters):
ix = np.where(clusters == group)
ax.scatter(scatter_x[ix], scatter_y[ix], c=cmap[group], label=group)
ax.legend()
plt.xlabel("PCA 0")
plt.ylabel("PCA 1")
plt.show()
Run Code Online (Sandbox Code Playgroud)
打印每个簇中的前 10 个单词。
order_centroids = km.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(3):
print("Cluster %d:" % i, end='')
for ind in order_centroids[i, :10]:
print(' %s' % terms[ind], end='')
print()
# Cluster 0: com edu medical yeast know cancer does doctor subject lines
# Cluster 1: edu game games team baseball com year don pitcher writes
# Cluster 2: edu car com subject organization lines university writes article
Run Code Online (Sandbox Code Playgroud)