如何使用KMeans查找与同一群集中的文档

Question

如何使用KMeans查找与同一群集中的文档

Stu*_*ner 14 python artificial-intelligence k-means scikit-learn

我将各种文章与Scikit-learn框架一起聚集在一起.以下是每个群集中的前15个单词:

Cluster 0: whales islands seaworld hurricane whale odile storm tropical kph mph pacific mexico orca coast cabos
Cluster 1: ebola outbreak vaccine africa usaid foundation virus cdc gates disease health vaccines experimental centers obama
Cluster 2: jones bobo sanford children carolina mississippi alabama lexington bodies crumpton mccarty county hyder tennessee sheriff
Cluster 3: isis obama iraq syria president isil airstrikes islamic li strategy terror military war threat al
Cluster 4: yosemite wildfire park evacuation dome firefighters blaze hikers cobb helicopter backcountry trails homes california evacuate

Run Code Online (Sandbox Code Playgroud)

我创建了"包词"矩阵,如下所示:

hasher = TfidfVectorizer(max_df=0.5,
                             min_df=2, stop_words='english',
                             use_idf=1)
vectorizer = make_pipeline(hasher, TfidfTransformer())
# document_text_list is a list of all text in a given article
X_train_tfidf = vectorizer.fit_transform(document_text_list)

Run Code Online (Sandbox Code Playgroud)

然后像这样运行KMeans:

km = sklearn.cluster.KMeans(init='k-means++', max_iter=10000, n_init=1,
                verbose=0, n_clusters=25)
km.fit(X_train_tfidf)

Run Code Online (Sandbox Code Playgroud)

我打印出这样的集群:

print("Top terms per cluster:")
order_centroids = km.cluster_centers_.argsort()[:, ::-1]
terms = hasher.get_feature_names()
for i in range(25):
    print("Cluster %d:" % i, end='')
    for ind in order_centroids[i, :15]:
        print(' %s' % terms[ind], end='')
    print()

Run Code Online (Sandbox Code Playgroud)

但是,我想知道如何确定哪些文档都属于同一个集群,理想情况下,它们各自与质心(集群)中心的距离.

我知道生成的matrix(X_train_tfidf)的每一行都对应一个文档,但是在执行KMeans算法之后没有明显的方法来获取这些信息.我怎样才能用scikit-learn做这件事？

X_train_tfidf 好像:

X_train_tfidf:   (0, 4661)  0.0405014425985
  (0, 19271)    0.0914545222775
  (0, 20393)    0.287636818634
  (0, 56027)    0.116893929188
  (0, 30872)    0.137815327338
  (0, 35256)    0.0343461345507
  (0, 31291)    0.209804679792
  (0, 66008)    0.0643776635222
  (0, 3806) 0.0967713285061
  (0, 66338)    0.0532881852791
  (0, 65023)    0.0702918299573
  (0, 41785)    0.197672720592
  (0, 29774)    0.120772893833
  (0, 61409)    0.0268609667042
  (0, 55527)    0.134102682463
  (0, 40011)    0.0582437010271
  (0, 19667)    0.0234843097048
  (0, 51667)    0.128270976476
  (0, 52791)    0.57198926651
  (0, 15014)    0.149195054799
  (0, 18805)    0.0277497826525
  (0, 35939)    0.170775938672
  (0, 5808) 0.0473913910636
  (0, 24922)    0.0126531527875
  (0, 10346)    0.0200098997901
  : :
  (23945, 56927)    0.0595132327966
  (23945, 23259)    0.0100977769025
  (23945, 12515)    0.0482102583442
  (23945, 49709)    0.210139450446
  (23945, 28742)    0.0190221880312
  (23945, 16628)    0.137692798005
  (23945, 53424)    0.157029848335
  (23945, 30647)    0.104485375827
  (23945, 57512)    0.0569754813269
  (23945, 39389)    0.0158180459761
  (23945, 26093)    0.0153713768922
  (23945, 9787) 0.0963777149738
  (23945, 23260)    0.158336452835
  (23945, 50595)    0.0527243936945
  (23945, 42447)    0.0527515904547
  (23945, 2829) 0.0351677269698
  (23945, 2832) 0.0175929392039
  (23945, 52079)    0.0849796887889
  (23945, 13523)    0.0878730969786
  (23945, 57849)    0.133869666381
  (23945, 25064)    0.128424780903
  (23945, 31129)    0.0919760384953
  (23945, 65601)    0.0388718258746
  (23945, 1428) 0.391477289626
  (23945, 2152) 0.655211469073
  X_train_tfidf shape: (23946, 67816)

Run Code Online (Sandbox Code Playgroud)

在回应ttttthomasssss的答案:

当我尝试运行以下内容时:

X_cluster_0 = X_train_tfidf[cluster_0]

Run Code Online (Sandbox Code Playgroud)

我收到错误:

File "cluster.py", line 52, in main
    X_cluster_0 = X_train_tfidf[cluster_0]
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/scipy/sparse/csr.py", line 226, in __getitem__
    col = key[1]
IndexError: tuple index out of range

Run Code Online (Sandbox Code Playgroud)

看结构cluster_0:

(array([  858,  2012,  2256,  2762,  2920,  3770,  6052,  6174,  8296,
9494,  9966, 10085, 11914, 12117, 12633, 12727, 12993, 13527,
13754, 14186, 14669, 14713, 14973, 15071, 15157, 15208, 15926,
16300, 16301, 17138, 17556, 17775, 18236, 19057, 20106, 21014, 21080]),)

Run Code Online (Sandbox Code Playgroud)

这是一个元组结构,其内容位于第0位,因此我将该行更改为以下内容:

X_cluster_0 = X_train_tfidf[cluster_0[0]]

Run Code Online (Sandbox Code Playgroud)

我从数据库中提取"文档",我可以轻松地从中获取索引(迭代提供的数组,直到我找到相应的文档[当然假设scikit不会改变矩阵中文档的顺序]).所以我不明白究竟X_cluster_0代表什么.X_cluster_0具有以下结构:

  X_cluster_0:   (0, 42726) 0.741747456202
  (0, 13535)    0.115880661286
  (0, 17447)    0.117608794277
  (0, 44849)    0.414829246262
  (0, 14574)    0.10214258736
  (0, 17317)    0.0634383214735
  (0, 17935)    0.0591234431875
  : :
  (17, 33867)   0.0174155914371
  (17, 48916)   0.0227046046275
  (17, 59132)   0.0168864861723
  (17, 40860)   0.0485813219503
  (17, 63725)   0.0271415763987
  (18, 45019)   0.490135684209
  (18, 36168)   0.14595160766
  (18, 52304)   0.139590524213
  (18, 63586)   0.16501953796
  (18, 28709)   0.15075416279
  (18, 11495)   0.0926490431993
  (18, 40860)   0.124236878928

Run Code Online (Sandbox Code Playgroud)

计算距质心的距离

当前运行建议的代码(distance = euclidean(X_cluster_0[0], km.cluster_centers_[0]))会导致以下错误:

File "cluster.py", line 68, in main
    distance = euclidean(X_cluster_0[0], km.cluster_centers_[0])
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/scipy/spatial/distance.py", line 211, in euclidean
    dist = norm(u - v)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/scipy/sparse/compressed.py", line 197, in __sub__
    raise NotImplementedError('adding a nonzero scalar to a '
NotImplementedError: adding a nonzero scalar to a sparse matrix is not supported

Run Code Online (Sandbox Code Playgroud)

这是km.cluster_centers看起来像:

km.cluster_centers: [  9.47080802e-05   2.53907413e-03   0.00000000e+00 ...,   0.00000000e+00
   0.00000000e+00   0.00000000e+00]

Run Code Online (Sandbox Code Playgroud)

我想我现在遇到的问题是如何提取矩阵的第i项(假设从左到右遍历矩阵).我指定索引任何嵌套级别没有差别(即X_cluster_0[0],X_cluster_0[0][0]和X_cluster_0[0][0][0]所有给我以上描绘的相同的打印输出矩阵结构).

Answer 1

ttt*_*sss 15

您可以使用该fit_predict()函数执行聚类并获取生成的聚类的索引.

获取每个文档的集群索引

您可以尝试以下方法:

km = sklearn.cluster.KMeans(init='k-means++', max_iter=10000, n_init=1,
                verbose=0, n_clusters=25)
clusters = km.fit_predict(X_train_tfidf)

# Note that your input data has dimensionality m x n and the clusters array has dimensionality m x 1 and contains the indices for every document
print X_train_tfidf.shape
print clusters.shape

# Example to get all documents in cluster 0
cluster_0 = np.where(clusters==0) # don't forget import numpy as np

# cluster_0 now contains all indices of the documents in this cluster, to get the actual documents you'd do:
X_cluster_0 = X_train_tfidf[cluster_0]

Run Code Online (Sandbox Code Playgroud)

找到每个文档到每个质心的距离

你可以通过这样做获得质心centroids = km.cluster_centers_,在你的情况下应该有维度25(簇数)xn(特征数).为了计算即文档与质心的欧氏距离,您可以使用SciPy(可以在此处找到scipy的各种距离度量的文档):

# Example, distance for 1 document to 1 cluster centroid
from scipy.spatial.distance import euclidean

distance = euclidean(X_cluster_0[0], km.cluster_centers_[0])
print distance

Run Code Online (Sandbox Code Playgroud)

更新:使用稀疏和密集矩阵的距离

距离度量scipy.spatial.distance要求输入矩阵是密集矩阵,因此如果X_cluster_0是稀疏矩阵,您可以将矩阵转换为密集矩阵:

d = euclidean(X_cluster_0.A[0], km.cluster_centers_[0]) # Note the .A on X_cluster_0
print d

Run Code Online (Sandbox Code Playgroud)

或者你可以使用scikit euclidean_distances()函数,它也适用于稀疏矩阵:

from sklearn.metrics.pairwise import euclidean_distances

D = euclidean_distances(X_cluster_0.getrow(0), km.cluster_centers_[0]) 
# This would be the equivalent expression to the above scipy example, however note that euclidean_distances returns a matrix and not a scalar
print D

Run Code Online (Sandbox Code Playgroud)

请注意,使用scikit方法,您还可以立即计算整个距离矩阵:

D = euclidean_distances(X_cluster_0, km.cluster_centers_)
print D

Run Code Online (Sandbox Code Playgroud)

更新:结构和类型`X_cluster_0`:

X_cluster_0以及X_train_tfidf稀疏矩阵(参见文档:) scipy.sparse.csr.csr_matrix.

倾销的解释如

(0, 13535)    0.115880661286
(0, 17447)    0.117608794277
(0, 44849)    0.414829246262
(0, 14574)    0.10214258736
.             .
.             .

Run Code Online (Sandbox Code Playgroud)

将如下:(0, 13535)指文件0和功能13535,所以你的文字袋矩阵中的行号0和列号13535.以下浮点数0.115880661286表示给定文档中该要素的tf-idf分数.

找出你可以尝试的确切单词hasher.get_feature_names()[13535](len(hasher.get_feature_names())首先查看你有多少功能).

如果您的语料库变量document_text_list是列表列表,那么相应的文档就是document_text_list[0].

归档时间：	11 年，8 月前
查看次数：	4469 次
最近记录：	11 年，6 月前

如何使用KMeans查找与同一群集中的文档

获取每个文档的集群索引

找到每个文档到每个质心的距离

更新:使用稀疏和密集矩阵的距离

更新:结构和类型X_cluster_0:

更新:结构和类型`X_cluster_0`: