python CountVectorizer()vocabulary_ get方法返回None

Nar*_* MG 1 python nltk scikit-learn

我根据http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html上的文档提供了这段代码.

from sklearn.datasets import load_files
from sklearn.feature_extraction.text import CountVectorizer

count_vect = CountVectorizer()

my_bunch = load_files("c:\\temp\\billing_test\\")

my_data = my_bunch['data']
print (my_bunch.keys())
print('target_names',my_bunch['target_names'])
print('length of data' , len(my_bunch['data']))


X_train_counts = count_vect.fit_transform(my_data)
print(X_train_counts.shape)

print ( count_vect.vocabulary_.get(u'algorithm'))
Run Code Online (Sandbox Code Playgroud)

输出如下

dict_keys(['target', 'filenames', 'target_names', 'data', 'DESCR'])
target_names ['false', 'true']
length of data 920
(920, 8773)
None
Run Code Online (Sandbox Code Playgroud)

不知道为什么"无"朝下(920,8773)

我在每个文件夹"true"和"false"中有大约460个文本文档

谢谢,

Far*_*eer 5

因为单词'algoritham'从未出现在您的文档中.

也许你应该试试'algorithm'.

  • 谢谢..现在我很尴尬....但是,即使改变了这个故事仍然是 - :( print(count_vect.vocabulary_.get(u'algorithm')) (2认同)