tum*_*eed 9 python nlp machine-learning scikit-learn
我想用scikit进行矢量化,了解列表中的列表.我去了我阅读它们的培训文本的路径然后我得到了这样的东西:
corpus = [["this is spam, 'SPAM'"],["this is ham, 'HAM'"],["this is nothing, 'NOTHING'"]]
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer(analyzer='word')
vect_representation= vect.fit_transform(corpus)
print vect_representation.toarray()
Run Code Online (Sandbox Code Playgroud)
我得到以下内容:
return lambda x: strip_accents(x.lower())
AttributeError: 'list' object has no attribute 'lower'
Run Code Online (Sandbox Code Playgroud)
此问题还有每个文档末尾的标签,我应该如何对待它们才能进行正确的分类?
tum*_*eed 14
对于未来的每个人来说,这解决了我的问题:
corpus = [["this is spam, 'SPAM'"],["this is ham, 'HAM'"],["this is nothing, 'NOTHING'"]]
from sklearn.feature_extraction.text import CountVectorizer
bag_of_words = CountVectorizer(tokenizer=lambda doc: doc, lowercase=False).fit_transform(splited_labels_from_corpus)
Run Code Online (Sandbox Code Playgroud)
当我使用该.toarray()函数时,这是输出:
[[0 0 1]
[1 0 0]
[0 1 0]]
Run Code Online (Sandbox Code Playgroud)
多谢你们