嵌入了literal\xhh的字节转义为unicode

Question

嵌入了literal\xhh的字节转义为unicode

我有: b'{"street":"Grossk\\xc3\\xb6lnstra\\xc3\\x9fe"}'

我需要: '{"street": "Grosskölnstraße"}'

我试过了:

s.decode('utf8'): # '{"street":"Grossk\\xc3\\xb6lnstra\\xc3\\x9fe"}'
s.decode('unicode_escape'): # '{"street":"GrosskÃ¶lnstraÃ\x9fe"}'

Run Code Online (Sandbox Code Playgroud)

什么是正确的方法？

Answer 1

Mar*_*ers 6

那是......你在那里很乱.这看起来像嵌入Python字节转义序列的UTF-8字节.

没有编解码器会再次产生字节作为输出; 你需要使用unicode_escape序列然后重新编码为Latin-1以返回UTF8字节,然后解码为UTF-8:

s.decode('unicode_escape').encode('latin1').decode('utf8')

Run Code Online (Sandbox Code Playgroud)

演示:

>>> s = b'{"street":"Grossk\\xc3\\xb6lnstra\\xc3\\x9fe"}'
>>> s.decode('unicode_escape').encode('latin1').decode('utf8')
'{"street":"Grosskölnstraße"}'

Run Code Online (Sandbox Code Playgroud)

另一种选择是仅\x[hexdigits]{3}在正则表达式中定位模式; 如果特定数据不是由错误的Python脚本生成的,那么这可能是更强大的选项:

import re
from functools import partial

escape = re.compile(rb'\\x([\da-f]{2})')
repair = partial(escape.sub, lambda m: bytes.fromhex(m.group(1).decode()))

Run Code Online (Sandbox Code Playgroud)

repair()返回一个bytes对象:

>>> repair(s)
b'{"street":"Grossk\xc3\xb6lnstra\xc3\x9fe"}'
>>> repair(s).decode('utf8')
'{"street":"Grosskölnstraße"}'

Run Code Online (Sandbox Code Playgroud)

归档时间：	7 年，6 月前
查看次数：	48 次
最近记录：	7 年，6 月前