如何允许解码 UTF-8 字节数组？

Question

如何允许解码 UTF-8 字节数组？

Joe*_*Joe 6 python error-handling decode utf-8 python-3.x

我需要将存储在字节数组中的 UTF-8 序列解码为字符串。

\n\n

UTF-8 序列可能包含错误部分。在这种情况下，我需要尽可能多地解码并（可选？）用“？”之类的内容替换无效部分。

\n\n

# First part decodes to "AB\xc3\x84C"\nb = bytearray([0x41, 0x42, 0xC3, 0x84, 0x43])\ns = str(b, "utf-8") \nprint(s)\n\n# Second part, invalid sequence, wanted to decode to something like "AB?C"\nb = bytearray([0x41, 0x42, 0xC3, 0x43])\ns = str(b, "utf-8")\nprint(s)\n

Run Code Online (Sandbox Code Playgroud)\n\n

在 Python 3 中实现这一目标的最佳方法是什么？

\n

Answer 1

Zer*_*eus 6

有几种内置的错误处理方案str用于编码和解码，bytes例如。例如：bytearraybytearray.decode()

\n\n\n\n

>>> b = bytearray([0x41, 0x42, 0xC3, 0x43])\n

Run Code Online (Sandbox Code Playgroud)\n\n

\n\n

>>> b.decode(\'utf8\', errors=\'ignore\')  # discard malformed bytes\n\'ABC\'\n

Run Code Online (Sandbox Code Playgroud)\n\n

\n\n

>>> b.decode(\'utf8\', errors=\'replace\')  # replace with U+FFFD\n\'AB\xef\xbf\xbdC\'\n

Run Code Online (Sandbox Code Playgroud)\n\n

\n\n

>>> b.decode(\'utf8\', errors=\'backslashreplace\')  # replace with backslash-escape\n\'AB\\\\xc3C\'\n

Run Code Online (Sandbox Code Playgroud)\n\n

此外，您可以编写自己的错误处理程序并注册它：

\n\n

import codecs\n\ndef my_handler(exception):\n    """Replace unexpected bytes with \'?\'."""\n    return \'?\', exception.end\n\ncodecs.register_error(\'my_handler\', my_handler)\n

Run Code Online (Sandbox Code Playgroud)\n\n

\n\n

>>> b.decode(\'utf8\', errors=\'my_handler\')\n\'AB?C\'\n

Run Code Online (Sandbox Code Playgroud)\n\n

所有这些错误处理方案也可以与str()构造函数一起使用，如您的问题所示：

\n\n

>>> str(b, \'utf8\', errors=\'my_handler\')\n\'AB?C\'\n

Run Code Online (Sandbox Code Playgroud)\n\n

...尽管明确使用它更惯用str.decode()。

\n\n

\n

归档时间：	9 年，1 月前
查看次数：	10997 次
最近记录：	9 年，1 月前