为什么我可以将 UTF-8 字节字符串解码为 ISO8859-1 并再次返回而没有任何 UnicodeEncodeError/UnicodeDecodeError？

Question

为什么我可以将 UTF-8 字节字符串解码为 ISO8859-1 并再次返回而没有任何 UnicodeEncodeError/UnicodeDecodeError？

以下如何在 Python 中没有任何错误地工作？

>>> '??'.encode('UTF-8').decode('ISO8859-1')
'ä½\xa0å¥½'
>>> _.encode('ISO8859-1').decode('UTF-8')
'??'

Run Code Online (Sandbox Code Playgroud)

我原以为它会因 UnicodeEncodeError 或 UnicodeDecodeError 而失败

是否有 ISO8859-1 和 UTF-8 的某些属性，以便我可以采用任何 UTF-8 编码字符串并将其解码为 ISO8859-1 字符串，稍后可以将其反转以获得原始 UTF-8 字符串？

我正在使用仅支持 ISO8859-1 字符集的旧数据库。通过将 UTF-8 编码的字符串解码为 ISO8859-1，并将生成的垃圾字符串存储在数据库中，开发人员似乎能够在该数据库中存储中文和其他语言。查询该数据库的下游系统必须在 ISO8859-1 中对垃圾字符串进行编码，然后使用 UTF-8 对结果进行解码以获得正确的字符串。

我会假设这样的过程根本行不通。

我错过了什么？

Answer 1

Mar*_*nen 5

ISO-8859-1 的特殊属性是它所代表的 256 个字符与前 256 个 Unicode 码位 1:1 对应，因此字节 00h 解码为 U+0000，字节 FFh 解码为 U+00FF。

因此，如果您编码为 UTF-8 并解码为 ISO-8859-1，您将得到一个由代码点组成的 Unicode 字符串，其值与编码的 UTF-8 字节匹配：

>>> s = '??'
>>> s.encode('utf8').hex()
'e4bda0e5a5bd'
>>> s.encode('utf8').decode('iso-8859-1')
'ä½\xa0å¥½'
>>> for c in u:
...  print(f'{c} U+{ord(c):04X}')
...
ä U+00E4   # Unicode code points are the same as the bytes of UTF-8.
½ U+00BD
  U+00A0
å U+00E5
¥ U+00A5
½ U+00BD
>>> u.encode('iso-8859-1').hex()  # transform back to bytes.
'e4bda0e5a5bd'
>>> u.encode('iso-8859-1').decode('utf8')   # and decode to UTF-8 again.
'??'

Run Code Online (Sandbox Code Playgroud)

任何具有所有 256 个字节表示的 8 位编码也可以使用，只是它不会是 1:1 映射。代码页 1256 就是这样一种编码：

>>> for c in s.encode('utf8').decode('cp1256'):
...  print(f'{c} U+{ord(c):04X}')
...
? U+0646   # This would still .encode('cp1256') back to byte E4, for example
½ U+00BD
  U+00A0
? U+0647
¥ U+00A5
½ U+00BD

Run Code Online (Sandbox Code Playgroud)

归档时间：	4 年，8 月前
查看次数：	95 次
最近记录：	4 年，8 月前