Python:UnicodeDecodeError:'utf8'编解码器无法解码字节

Question

Python:UnicodeDecodeError:'utf8'编解码器无法解码字节

Zac*_*ach 13 python encoding utf-8 scikit-learn

我正在将一堆RTF文件读入python字符串.在某些文本中,我收到此错误:

Traceback (most recent call last):
  File "11.08.py", line 47, in <module>
    X = vectorizer.fit_transform(texts)
  File "C:\Python27\lib\site-packages\sklearn\feature_extraction\text.py", line
716, in fit_transform
    X = super(TfidfVectorizer, self).fit_transform(raw_documents)
  File "C:\Python27\lib\site-packages\sklearn\feature_extraction\text.py", line
398, in fit_transform
    term_count_current = Counter(analyze(doc))
  File "C:\Python27\lib\site-packages\sklearn\feature_extraction\text.py", line
313, in <lambda>
    tokenize(preprocess(self.decode(doc))), stop_words)
  File "C:\Python27\lib\site-packages\sklearn\feature_extraction\text.py", line
224, in decode
    doc = doc.decode(self.charset, self.charset_error)
  File "C:\Python27\lib\encodings\utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 462: invalid
 start byte

Run Code Online (Sandbox Code Playgroud)

我试过了:

将文件的文本复制并粘贴到新文件
将rtf文件保存为txt文件
在Notepad ++中打开txt文件并选择"convert to utf-8"并将编码设置为utf-8
使用Microsoft Word打开文件并将其另存为新文件

什么都行不通.有任何想法吗？

它可能没有关系,但是这里有你想知道的代码:

f = open(dir+location, "r")
doc = Rtf15Reader.read(f)
t = PlaintextWriter.write(doc).getvalue()
texts.append(t)
f.close()
vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5, stop_words='english')
X = vectorizer.fit_transform(texts)

Run Code Online (Sandbox Code Playgroud)

Answer 1

Jos*_*era 10

这将解决您的问题:

import codecs

f = codecs.open(dir+location, 'r', encoding='utf-8')
txt = f.read()

Run Code Online (Sandbox Code Playgroud)

从那一刻起,txt采用unicode格式,您可以在代码中的任何位置使用它.

如果要在处理后生成UTF-8文件,请执行以下操作:

f.write(txt.encode('utf-8'))

Run Code Online (Sandbox Code Playgroud)

新的 open() 返回： `Traceback (最近一次调用最后): File "11.08.py", line 41, in <module> t = f.read() File "C:\Python27\lib\codecs.py" ，第 671 行，在读取中返回 self.reader.read(size) 文件“C:\Python27\lib\codecs.py”，第 477 行，在读取 newchars 中，decodedbytes = self.decode(data, self.errors) UnicodeDecodeError: “utf8”编解码器无法解码位置 1266 中的字节 0x92：起始字节无效` (3认同)

Answer 2

And*_*ler 6

正如我在邮件列表中所说,使用该charset_error选项并将其设置为最简单ignore.如果文件实际上是utf-16,您还可以在Vectorizer中将字符集设置为utf-16.查看文档.

归档时间：	13 年，7 月前
查看次数：	40450 次
最近记录：	6 年，10 月前