获取选定的功能名称TFIDF Vectorizer

Question

获取选定的功能名称TFIDF Vectorizer

我正在使用python,我想获得大型数据集的TFIDF表示,我使用以下代码将文档转换为TFIDF形式.

from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(
    min_df=1,  # min count for relevant vocabulary
    max_features=4000,  # maximum number of features
    strip_accents='unicode',  # replace all accented unicode char 
    # by their corresponding  ASCII char
    analyzer='word',  # features made of words
    token_pattern=r'\w{1,}',  # tokenize only words of 4+ chars
    ngram_range=(1, 1),  # features made of a single tokens
    use_idf=True,  # enable inverse-document-frequency reweighting
    smooth_idf=True,  # prevents zero division for unseen words
    sublinear_tf=False)

tfidf_df = tfidf_vectorizer.fit_transform(df['text'])

Run Code Online (Sandbox Code Playgroud)

在这里我传递一个参数max_features.矢量化器将选择最佳特征并返回scipy稀疏矩阵.问题是我不知道哪些功能被选中,我如何将这些功能名称映射回我得到的scipy矩阵？基本上对于n来自m文档数量的所选特征,我想要一个m x n矩阵,其中所选特征作为列名而不是它们的整数id.我该如何做到这一点？

Answer 1

Viv*_*mar 18

你可以用tfidf_vectorizer.get_feature_names().这将打印原始文档中选择的功能名称(选定的术语).

您还可以使用tfidf_vectorizer.vocabulary_属性来获取将要素名称映射到其索引的dict,但不会对其进行排序.数组get_feature_names()将按索引排序.

@InsParbo您可以对像`arr [：5]`这样的数组进行切片，以显示前5个值。它只是一个数组，根据需要查看。 (2认同)

Answer 2

ors*_*ady 5

use tfidf_vectorizer.vocabulary_，这给出了特征（项回到索引）的映射

归档时间：	9 年，3 月前
查看次数：	14917 次
最近记录：	9 年，3 月前