如何在scikit-learn中查看tfidf之后的term-document矩阵的前n个条目

Amr*_*hna 37 python numpy tf-idf top-n scikit-learn

我是scikit-learn的新手,我TfidfVectorizer用来在一组文档中找到术语的tfidf值.我用下面的代码来获得相同的代码.

vectorizer = TfidfVectorizer(stop_words=u'english',ngram_range=(1,5),lowercase=True)
X = vectorizer.fit_transform(lectures)
Run Code Online (Sandbox Code Playgroud)

现在如果我打印X,我能够看到矩阵中的所有条目,但我如何根据tfidf分数找到前n个条目.除此之外,是否有任何方法可以帮助我找到基于每个ngram的tfidf得分的前n个条目,即unigram,bigram,trigram等中的顶级条目?

YS-*_*S-L 54

从版本0.15开始,TfidfVectorizer可以通过属性访问由a学习的特征的全局术语加权,该属性idf_将返回长度等于特征维度的数组.通过此权重对要素进行排序,以获得最高加权要素:

from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np

lectures = ["this is some food", "this is some drink"]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(lectures)
indices = np.argsort(vectorizer.idf_)[::-1]
features = vectorizer.get_feature_names()
top_n = 2
top_features = [features[i] for i in indices[:top_n]]
print top_features
Run Code Online (Sandbox Code Playgroud)

输出:

[u'food', u'drink']
Run Code Online (Sandbox Code Playgroud)

通过ngram获取顶级特征的第二个问题可以使用相同的想法完成,还有一些额外的步骤将特征分成不同的组:

from sklearn.feature_extraction.text import TfidfVectorizer
from collections import defaultdict

lectures = ["this is some food", "this is some drink"]
vectorizer = TfidfVectorizer(ngram_range=(1,2))
X = vectorizer.fit_transform(lectures)
features_by_gram = defaultdict(list)
for f, w in zip(vectorizer.get_feature_names(), vectorizer.idf_):
    features_by_gram[len(f.split(' '))].append((f, w))
top_n = 2
for gram, features in features_by_gram.iteritems():
    top_features = sorted(features, key=lambda x: x[1], reverse=True)[:top_n]
    top_features = [f[0] for f in top_features]
    print '{}-gram top:'.format(gram), top_features
Run Code Online (Sandbox Code Playgroud)

输出:

1-gram top: [u'drink', u'food']
2-gram top: [u'some drink', u'some food']
Run Code Online (Sandbox Code Playgroud)

  • 我究竟如何在讲座中获得每个文档的前n个ngrams,而不是整个前k个元素 (6认同)
  • 这似乎不是按 TF-IDF 而是按字母顺序排序。 (2认同)