绘制文档tfidf 2D图

jxn*_*jxn 15 python numpy scipy k-means scikit-learn

我想绘制一个二维图形,其中x轴为术语,y轴为TFIDF得分(或文档ID),用于我的句子列表.我使用scikit learn's fit_transform()来获取scipy矩阵,但我不知道如何使用该矩阵绘制图形.我试图得到一个情节,看看我的句子可以用kmeans进行分类.

这是输出fit_transform(sentence_list):

(文件ID,术语编号)tfidf得分

(0, 1023)   0.209291711271
(0, 924)    0.174405532933
(0, 914)    0.174405532933
(0, 821)    0.15579574484
(0, 770)    0.174405532933
(0, 763)    0.159719994016
(0, 689)    0.135518787598
Run Code Online (Sandbox Code Playgroud)

这是我的代码:

sentence_list=["Hi how are you", "Good morning" ...]
vectorizer=TfidfVectorizer(min_df=1, stop_words='english', decode_error='ignore')
vectorized=vectorizer.fit_transform(sentence_list)
num_samples, num_features=vectorized.shape
print "num_samples:  %d, num_features: %d" %(num_samples,num_features)
num_clusters=10
km=KMeans(n_clusters=num_clusters, init='k-means++',n_init=10, verbose=1)
km.fit(vectorized)
PRINT km.labels_   # Returns a list of clusters ranging 0 to 10 
Run Code Online (Sandbox Code Playgroud)

谢谢,

ely*_*ase 32

当你使用Bag of Words时,你的每个句子都会在一个长度等于词汇的高维空间中表示.如果要在2D中表示这一点,则需要减小尺寸,例如使用具有两个组件的PCA:

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
import matplotlib.pyplot as plt

newsgroups_train = fetch_20newsgroups(subset='train', 
                                      categories=['alt.atheism', 'sci.space'])
pipeline = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
])        
X = pipeline.fit_transform(newsgroups_train.data).todense()

pca = PCA(n_components=2).fit(X)
data2D = pca.transform(X)
plt.scatter(data2D[:,0], data2D[:,1], c=data.target)
plt.show()              #not required if using ipython notebook
Run Code Online (Sandbox Code Playgroud)

data2d

现在您可以计算并绘制群集输入此数据:

from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=2).fit(X)
centers2D = pca.transform(kmeans.cluster_centers_)

plt.hold(True)
plt.scatter(centers2D[:,0], centers2D[:,1], 
            marker='x', s=200, linewidths=3, c='r')
plt.show()              #not required if using ipython notebook
Run Code Online (Sandbox Code Playgroud)

在此输入图像描述

  • 我得到错误的'plt.scatter(data2D [:,0],data2D [:,1],c = data.target)`具体来说`c = data.target`.如果我想将散点图的颜色调整为kmeans发现的簇的颜色,我应该用什么代替`data.target`?`kmeans.label_`?#this返回一个列表. (8认同)
  • 代替data.target使用newsgroups_train.target (3认同)