如何在 k-means 聚类中使用 tfidf 值

Question

如何在 k-means 聚类中使用 tfidf 值

Sid*_*Sid 2 nlp tf-idf k-means python-3.x tfidfvectorizer

我使用 sckit-learn 库将 K-means 聚类与 TF-IDF 结合使用。我知道 K-means 使用距离来创建集群，距离用（x 轴值，y 轴值）表示，但 tf-idf 是单个数值。我的问题是这个 tf-idf 值是如何通过 K-means 聚类转换为 (x,y) 值的。

Answer 1

alv*_*vas 6

TF-IDF 不是单个值（即标量）。对于每个文档，它返回一个向量，其中向量中的每个值对应于词汇表中的每个单词。

from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
from scipy.sparse.csr import csr_matrix

sent1 = "the quick brown fox jumps over the lazy brown dog"
sent2 = "mr brown jumps over the lazy fox"

corpus = [sent1, sent2]
vectorizer = TfidfVectorizer(input=corpus)

X = vectorizer.fit_transform(corpus)
print(X.todense())

Run Code Online (Sandbox Code Playgroud)

[出去]：

matrix([[0.50077266, 0.35190925, 0.25038633, 0.25038633, 0.25038633,
         0.        , 0.25038633, 0.35190925, 0.50077266],
        [0.35409974, 0.        , 0.35409974, 0.35409974, 0.35409974,
         0.49767483, 0.35409974, 0.        , 0.35409974]])

Run Code Online (Sandbox Code Playgroud)

它返回一个二维矩阵，其中行代表句子，列代表词汇。

>>> vectorizer.vocabulary_
{'the': 8,
 'quick': 7,
 'brown': 0,
 'fox': 2,
 'jumps': 3,
 'over': 6,
 'lazy': 4,
 'dog': 1,
 'mr': 5}

Run Code Online (Sandbox Code Playgroud)

因此，当 K-means 试图找到两个文档之间的距离/相似性时，它正在执行矩阵中两行之间的相似性。例如，假设相似度只是两行之间的点积：

import numpy as np
vector1 = X.todense()[0]
vector2 = X.todense()[1]
float(np.dot(vector1, vector2.T))

Run Code Online (Sandbox Code Playgroud)

[出去]：

0.7092938737640962

Run Code Online (Sandbox Code Playgroud)

Chris Potts 有一个很好的教程，介绍了如何创建 TF-IDF 这样的向量空间模型http://web.stanford.edu/class/linguist236/materials/ling236-handout-05-09-vsm.pdf

归档时间：	5 年，6 月前
查看次数：	2645 次
最近记录：	5 年，6 月前