Chr*_* T. 7 python ranking tf-idf scikit-learn
我在部分文本数据上使用了来自sklearn的TfidfVectorizer(),以了解每个功能(词)的术语频率感。我当前的代码如下
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(analyzer='word', stop_words = 'english')
# fit_transform on training data
X_traintfidf = tfidf.fit_transform(X_train)
Run Code Online (Sandbox Code Playgroud)
如果我想将“ X_traintfidf”中每个术语的tf-idf值从最低到最高排序(反之亦然),例如top10,并将这些排序的tf-idf值排名分为两个Series对象,我应该如何进行从我的代码的最后一行开始?
谢谢。
我在读类似的主题,但不知道该怎么做。也许有人可以将该主题中显示的提示与此处的问题联系起来。
之后的fit_transform(),您将可以通过get_feature_names()方法访问现有词汇表。你可以这样做:
terms = tfidf.get_feature_names()
# sum tfidf frequency of each term through documents
sums = X_traintfidf.sum(axis=0)
# connecting term to its sums frequency
data = []
for col, term in enumerate(terms):
data.append( (term, sums[0,col] ))
ranking = pd.DataFrame(data, columns=['term','rank'])
print(ranking.sort_values('rank', ascending=False))
Run Code Online (Sandbox Code Playgroud)