Python中的UnicodeDecodeError同时从英语维基百科中读取UTF-8 sql文件

Question

Python中的UnicodeDecodeError同时从英语维基百科中读取UTF-8 sql文件

更新:我已将编码更改为

with open("../data/enwiki-20131202-pagelinks.sql", encoding="ISO-8859-1")

Run Code Online (Sandbox Code Playgroud)

......而且程序现在正在咀嚼文件而没有投诉.也许SQL转储不是UTF-8并且不包含这样的文字,这是我的错误假设.

原版的:

我正在尝试处理维基百科的一个庞大的数据集,即pagelinks.sql文件.

不幸的是,我在阅读文件时遇到以下错误:

(...)
File "c:\Program Files\Python 3.3\lib\codecs.py", line 301, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf8 in position 5095: invalid start byte

Run Code Online (Sandbox Code Playgroud)

我的代码如下:

import re

reg1 = re.compile(",0,")
ref_count = 0
with open("../data/enwiki-20131202-pagelinks.sql", encoding="utf8") as infile:
    for line in infile:
        matches = re.findall(reg1, line)
        ref_count += len(matches)

print ("found", ref_count, "references.")

Run Code Online (Sandbox Code Playgroud)

Answer 1

ber*_*nie 5

此处"Unicode"标题下的评论摘录http://meta.wikimedia.org/wiki/Data_dumps/Dump_format可能会有所帮助:

"由于早期的MediaWiki版本中的lenient charset验证,转储可能包含旧文本修订中的非Unicode(UTF8)字符......"

暂时忽略Unicode和UTF8的混合,你可以做些什么来避免错误,将errors关键字参数传递给open(),例如:

filepath = "../data/enwiki-20131202-pagelinks.sql" 
with open(filepath, encoding="utf8", errors='replace') as infile:
    ...

Run Code Online (Sandbox Code Playgroud)

这"导致?在有错误数据的地方插入替换标记(例如)." http://docs.python.org/3/library/functions.html#open

如果你宁愿忽略你可以使用的非UTF8字符errors='ignore'.

归档时间：	12 年，2 月前
查看次数：	1825 次
最近记录：	11 年前