我正在尝试使用scikit-learn来计算一个简单的单词频率CountVectorizer.
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
texts=["dog cat fish","dog cat cat","fish bird","bird"]
cv = CountVectorizer()
cv_fit=cv.fit_transform(texts)
print cv.vocabulary_
{u'bird': 0, u'cat': 1, u'dog': 2, u'fish': 3}
Run Code Online (Sandbox Code Playgroud)
我期待它回归{u'bird': 2, u'cat': 3, u'dog': 2, u'fish': 2}.