按tf-idf对TfidfVectorizer输出进行排序(从最低到最高,反之亦然)

Chr*_* T. 7 python ranking tf-idf scikit-learn

我在部分文本数据上使用了来自sklearn的TfidfVectorizer(),以了解每个功能(词)的术语频率感。我当前的代码如下

from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(analyzer='word', stop_words = 'english')

# fit_transform on training data
X_traintfidf = tfidf.fit_transform(X_train)
Run Code Online (Sandbox Code Playgroud)

如果我想将“ X_traintfidf”中每个术语的tf-idf值从最低到最高排序(反之亦然),例如top10,并将这些排序的tf-idf值排名分为两个Series对象,我应该如何进行从我的代码的最后一行开始?

谢谢。

我在读类似的主题,但不知道该怎么做。也许有人可以将该主题中显示的提示与此处的问题联系起来。

Ade*_*újo 7

之后的fit_transform(),您将可以通过get_feature_names()方法访问现有词汇表。你可以这样做:

terms = tfidf.get_feature_names()

# sum tfidf frequency of each term through documents
sums = X_traintfidf.sum(axis=0)

# connecting term to its sums frequency
data = []
for col, term in enumerate(terms):
    data.append( (term, sums[0,col] ))

ranking = pd.DataFrame(data, columns=['term','rank'])
print(ranking.sort_values('rank', ascending=False))
Run Code Online (Sandbox Code Playgroud)