Sim*_*ity 2 python scikit-learn
我有下面的代码段我想列出词频,在那里first_text和second_text有.tex文件:
from sklearn.feature_extraction.text import CountVectorizer
training_documents = (first_text, second_text)
vectorizer = CountVectorizer()
vectorizer.fit_transform(training_documents)
print "Vocabulary:", vectorizer.vocabulary
Run Code Online (Sandbox Code Playgroud)
运行脚本时,得到以下信息:
File "test.py", line 19, in <module>
vectorizer.fit_transform(training_documents)
File "/usr/local/lib/python2.7/site-packages/sklearn/feature_extraction/text.py", line 817, in fit_transform
self.fixed_vocabulary_)
File "/usr/local/lib/python2.7/site-packages/sklearn/feature_extraction/text.py", line 752, in _count_vocab
for feature in analyze(doc):
File "/usr/local/lib/python2.7/site-packages/sklearn/feature_extraction/text.py", line 238, in <lambda>
tokenize(preprocess(self.decode(doc))), stop_words)
File "/usr/local/lib/python2.7/site-packages/sklearn/feature_extraction/text.py", line 115, in decode
doc = doc.decode(self.encoding, self.decode_error)
File "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa2 in position 200086: invalid start byte
Run Code Online (Sandbox Code Playgroud)
如何解决此问题?
谢谢。
如果你能找出你的文档的编码是(也许他们latin-1),你可以通过这个来CountVectorizer用
vectorizer = CountVectorizer(encoding='latin-1')
Run Code Online (Sandbox Code Playgroud)
否则,您可以使用跳过包含有问题字节的令牌
vectorizer = CountVectorizer(decode_error='ignore')
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
1551 次 |
| 最近记录: |