找到Scikit-learn分类器中最常用的术语

Question

找到Scikit-learn分类器中最常用的术语

Nyx*_*nyx 3 python numpy scipy python-2.7 scikit-learn

我在下面的例子中Scikit学习文档,其中CountVectorizer的一些数据集使用.

问题:count_vect.vocabulary_.viewitems()列出所有条款及其频率.你如何根据出现次数对它们进行排序？

sorted( count_vect.vocabulary_.viewitems() ) 似乎不起作用.

Answer 1

vocabulary_.viewitems()实际上并没有列出术语及其频率,而是列出从术语到索引的映射.fit_transform方法返回频率(每个文档),返回稀疏(coo)矩阵,其中行是文档,列是单词(列索引通过词汇表映射到单词).例如,您可以获得总频率

matrix = count_vect.fit_transform(doc_list)
freqs = zip(count_vect.get_feature_names(), matrix.sum(axis=0))    
# sort from largest to smallest
print sorted(freqs, key=lambda x: -x[1])

Run Code Online (Sandbox Code Playgroud)

您需要将 `matrix.sum(axis=0)` 替换为 `matrix.sum(axis=0).tolist()[0]`，因为 matrix.sum() 返回一个矩阵。 (3认同)

归档时间：	13 年，1 月前
查看次数：	5435 次
最近记录：	8 年，3 月前