ECu*_*evs 5 python n-gram scikit-learn tfidfvectorizer
如何使用TF-IDF vectorizer从scikit学习库提取unigrams和bigrams鸣叫的?我想用输出训练分类器。
这是来自 scikit-learn 的代码:
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
'This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?',
]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
Run Code Online (Sandbox Code Playgroud)
TfidfVectorizer有一个ngram_range参数来确定您想要在最终矩阵中作为新特征的 n 元语法范围。就您而言,您希望(1,2)从一元词变为二元词:
vectorizer = TfidfVectorizer(ngram_range=(1,2))
X = vectorizer.fit_transform(corpus).todense()
pd.DataFrame(X, columns=vectorizer.get_feature_names())
and and this document document is first first document \
0 0.000000 0.000000 0.314532 0.000000 0.388510 0.388510
1 0.000000 0.000000 0.455513 0.356824 0.000000 0.000000
2 0.357007 0.357007 0.000000 0.000000 0.000000 0.000000
3 0.000000 0.000000 0.282940 0.000000 0.349487 0.349487
is is the is this one ... the the first \
0 0.257151 0.314532 0.000000 0.000000 ... 0.257151 0.388510
1 0.186206 0.227756 0.000000 0.000000 ... 0.186206 0.000000
2 0.186301 0.227873 0.000000 0.357007 ... 0.186301 0.000000
3 0.231322 0.000000 0.443279 0.000000 ... 0.231322 0.349487
...
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
479 次 |
| 最近记录: |