用于提取 ngram 的 TF-IDF 向量化器

ECu*_*evs 5 python n-gram scikit-learn tfidfvectorizer

如何使用TF-IDF vectorizer从scikit学习库提取unigramsbigrams鸣叫的?我想用输出训练分类器。

这是来自 scikit-learn 的代码:

from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
Run Code Online (Sandbox Code Playgroud)

yat*_*atu 4

TfidfVectorizer有一个ngram_range参数来确定您想要在最终矩阵中作为新特征的 n 元语法范围。就您而言,您希望(1,2)从一元词变为二元词:

vectorizer = TfidfVectorizer(ngram_range=(1,2))
X = vectorizer.fit_transform(corpus).todense()

pd.DataFrame(X, columns=vectorizer.get_feature_names())

        and  and this  document  document is     first  first document  \
0  0.000000  0.000000  0.314532     0.000000  0.388510        0.388510   
1  0.000000  0.000000  0.455513     0.356824  0.000000        0.000000   
2  0.357007  0.357007  0.000000     0.000000  0.000000        0.000000   
3  0.000000  0.000000  0.282940     0.000000  0.349487        0.349487   

         is    is the   is this       one  ...       the  the first  \
0  0.257151  0.314532  0.000000  0.000000  ...  0.257151   0.388510   
1  0.186206  0.227756  0.000000  0.000000  ...  0.186206   0.000000   
2  0.186301  0.227873  0.000000  0.357007  ...  0.186301   0.000000   
3  0.231322  0.000000  0.443279  0.000000  ...  0.231322   0.349487   
...
Run Code Online (Sandbox Code Playgroud)