使用sklearn.feature_extraction.text.TfidfVectorizer的tf-idf特征权重

fas*_*oth 27 python tf-idf scikit-learn

本页:http://scikit-learn.org/stable/modules/feature_extraction.html提及:

由于tf-idf经常用于文本特征,因此还有另一个名为TfidfVectorizer的类,它将CountVectorizerTfidfTransformer的所有选项组合在一个模型中.

然后我按照代码在我的语料库上使用fit_transform().如何获得fit_transform()计算的每个特征的权重?

我试过了:

In [39]: vectorizer.idf_
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-39-5475eefe04c0> in <module>()
----> 1 vectorizer.idf_

AttributeError: 'TfidfVectorizer' object has no attribute 'idf_'
Run Code Online (Sandbox Code Playgroud)

但是这个属性丢失了.

谢谢

YS-*_*S-L 78

由于0.15版本,每个特征的TF-IDF评分可以通过属性来检索idf_所述的TfidfVectorizer对象:

from sklearn.feature_extraction.text import TfidfVectorizer
corpus = ["This is very strange",
          "This is very nice"]
vectorizer = TfidfVectorizer(min_df=1)
X = vectorizer.fit_transform(corpus)
idf = vectorizer.idf_
print dict(zip(vectorizer.get_feature_names(), idf))
Run Code Online (Sandbox Code Playgroud)

输出:

{u'is': 1.0,
 u'nice': 1.4054651081081644,
 u'strange': 1.4054651081081644,
 u'this': 1.0,
 u'very': 1.0}
Run Code Online (Sandbox Code Playgroud)

正如评论中所讨论的,在版本0.15之前,解决方法是idf_通过所谓的矢量化器的隐藏_tfidf(实例TfidfTransformer)访问该属性:

idf = vectorizer._tfidf.idf_
print dict(zip(vectorizer.get_feature_names(), idf))
Run Code Online (Sandbox Code Playgroud)

它应该提供与上面相同的输出.

  • @ YS-L这只是IDF得分,正确,不是完整的TF-IDF? (4认同)