将utf-8字符串作为内容转换为str的unicode

Question

将utf-8字符串作为内容转换为str的unicode

won*_*ng2 10 python utf-8 python-2.x mojibake pyquery

我正在使用pyquery来解析页面:

dom = PyQuery('http://zh.wikipedia.org/w/index.php', {'title': 'CSS', 'printable': 'yes', 'variant': 'zh-cn'})
content = dom('#mw-content-text > p').eq(0).text()

Run Code Online (Sandbox Code Playgroud)

但我得到的content是一个带有utf-8编码内容的unicode字符串:

u'\xe5\xb1\x82\xe5\x8f\xa0\xe6\xa0\xb7\xe5\xbc\x8f\xe8\xa1\xa8...'

Run Code Online (Sandbox Code Playgroud)

我怎么能把它转换成str没有丢失的内容？

说清楚:

我想要 conent == '\xe5\xb1\x82\xe5\x8f\xa0\xe6\xa0\xb7\xe5\xbc\x8f\xe8\xa1\xa8'

不 conent == u'\xe5\xb1\x82\xe5\x8f\xa0\xe6\xa0\xb7\xe5\xbc\x8f\xe8\xa1\xa8'

Answer 1

Mar*_*ers 26

如果您有一个unicodeUTF-8字节的值,请编码为Latin-1以保留'bytes':

content = content.encode('latin1')

Run Code Online (Sandbox Code Playgroud)

因为Unicode代码点U + 0000到U + 00FF所有地图单对一个与Latin-1编码; 因此,此编码将您的数据解释为文字字节.

对于你的例子,这给了我:

>>> content = u'\xe5\xb1\x82\xe5\x8f\xa0\xe6\xa0\xb7\xe5\xbc\x8f\xe8\xa1\xa8'
>>> content.encode('latin1')
'\xe5\xb1\x82\xe5\x8f\xa0\xe6\xa0\xb7\xe5\xbc\x8f\xe8\xa1\xa8'
>>> content.encode('latin1').decode('utf8')
u'\u5c42\u53e0\u6837\u5f0f\u8868'
>>> print content.encode('latin1').decode('utf8')
?????

Run Code Online (Sandbox Code Playgroud)

PyQuery使用requests或urllib检索HTML,如果requests使用,则使用.text响应的属性.这仅基于Content-Type标题中的编码集自动解码响应数据,或者如果该信息不可用,则latin-1用于此(对于文本响应,但HTML是文本响应).您可以通过传入encoding参数来覆盖它:

dom = PyQuery('http://zh.wikipedia.org/w/index.php', encoding='utf8',
              {'title': 'CSS', 'printable': 'yes', 'variant': 'zh-cn'})

Run Code Online (Sandbox Code Playgroud)

此时你根本不需要重新编码.

归档时间：	12 年，7 月前
查看次数：	44220 次
最近记录：	6 年，12 月前