错误=代理转义 vs 错误=替换

Question

错误=代理转义 vs 错误=替换

我正在尝试打开这样的文件：

with open("myfile.txt", encoding="utf-8") as f:

Run Code Online (Sandbox Code Playgroud)

但myfile.txt来自我的应用程序的用户。90% 的情况下，该文件为非 UTF-8，这会导致应用程序因无法正确读取而退出。错误就像'utf-8' codec can't decode byte 0x9c

我用 Google 搜索了一下，发现了一些 Stackoverflow 的答案，说要像这样打开我的文件：

with open("myfile.txt", encoding="utf-8", errors="surrogateescape") as f:

Run Code Online (Sandbox Code Playgroud)

但其他答案说使用：

with open("myfile.txt", encoding="utf-8", errors="replace") as f:

Run Code Online (Sandbox Code Playgroud)

errors="replace"那么和之间有什么区别，errors="surrogateescape"哪一个会修复文件中的非 UTF-8 字节呢？

Answer 1

Ser*_*sta 6

医生说：

\n\n

\n
\'replace\': \n 替换为合适的替换标记；Python 将在解码时使用官方的 U+FFFD 替换字符作为内置编解码器，在编码时使用 \xe2\x80\x98?\xe2\x80\x99 。在replace_errors()中实现。
...
\n \'surrogateescape\'：解码时，用范围从 U+DC80 到 U+DCFF 的单独代理代码替换字节。当编码数据时使用“surrogateescape”错误处理程序时，该代码将被转回相同的字节。（更多信息请参见 PEP 383。）
\n

\n\n

这意味着使用时replace，任何有问题的字节都将被替换为相同的U+FFFD替换字符，而使用时，surrogateescape每个字节都将被替换为不同的值。例如，a将\'\\xe9\'被替换为 a\'\\udce9\'和。\'\\xe8\'\'\\udce8\'

\n\n

因此，使用替换，您将获得有效的unicode字符，但会丢失文件的原始内容，而使用surrogateescape，您可以知道原始字节（甚至可以使用准确地重建它.encode(errors=\'surrogateescape\')），但您的unicode字符串不正确，因为它包含原始代理项代码。

\n\n

长话短说：如果原始的有问题的字节无关紧要，而您只是想消除错误replace，那么这是一个不错的选择，如果您需要保留它们以供以后处理，那么surrogateescape这是一个不错的选择。

\n\n

surrogateescape当您的文件主要包含 ascii 字符和一些（带重音的）非 ascii 字符时，有一个非常好的功能。还有一些用户偶尔会使用非 UTF8 编辑器修改文件（或未能声明 UTF8 编码）。在这种情况下，您将得到一个主要包含 utf8 数据和一些不同编码的字节的文件，对于使用非英语西欧语言（如法语、葡萄牙语或西班牙语）的 Windows 用户来说，通常为 CP1252。在这种情况下，可以构建一个转换表，将代理字符映射到 cp1252 字符集中的等效字符：

\n\n

# first map all surrogates in the range 0xdc80-0xdcff to codes 0x80-0xff\ntab0 = str.maketrans(\'\'.join(range(0xdc80, 0xdd00)),\n             \'\'.join(range(0x80, 0x100)))\n# then decode all bytes in the range 0x80-0xff as cp1252, and map the undecoded ones\n#  to latin1 (using previous transtable)\nt = bytes(range(0x80, 0x100)).decode(\'cp1252\', errors=\'surrogateescape\').translate(tab0)\n# finally use above string to build a transtable mapping surrogates in the range 0xdc80-0xdcff\n#  to their cp1252 equivalent, or latin1 if byte has no value in cp1252 charset\ntab = str.maketrans(\'\'.join(chr(i) for i in range(0xdc80, 0xdd00)), t)\n

Run Code Online (Sandbox Code Playgroud)\n\n

然后，您可以解码包含 utf8 和 cp1252 mojibake 的文件：

\n\n

with open("myfile.txt", encoding="utf-8", errors="surrogateescape") as f:\n    for line in f:                     # ok utf8 has been decoded here\n        line = line.translate(tab)     # and cp1252 bytes are recovered here\n

Run Code Online (Sandbox Code Playgroud)\n\n

我已经多次成功地使用该方法来恢复以 utf8 格式生成并在 Windows 计算机上使用 Excel 编辑的 csv 文件。

\n\n

相同的方法可用于从 ascii 派生的其他字符集

\n

归档时间：	6 年，8 月前
查看次数：	5494 次
最近记录：	6 年，8 月前