如何在Python中将HTML非ASCII数据编码为UTF-8

Question

如何在Python中将HTML非ASCII数据编码为UTF-8

我试着这样做,我发现了这个错误:

>>> import re  
>>> x = 'Ingl\xeas'  
>>> x  
'Ingl\xeas'  
>>> print x  
Ingl?s  
>>> x.decode('utf8')  
Traceback (most recent call last):  
    File "<stdin>", line 1, in <module>  
    File "/usr/lib/python2.6/encodings/utf_8.py", line 16, in decode  
        return codecs.utf_8_decode(input, errors, True)  
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 4-5: unexpected end of data  
>>> x.decode('utf8', 'ignore')  
u'Ingl'  
>>> x.decode('utf8', 'replace')  
u'Ingl\ufffd'  
>>> print x.decode('utf8', 'replace')  
Ingl?  
>>> print x.decode('utf8', 'xmlcharrefreplace')  
Traceback (most recent call last):  
    File "<stdin>", line 1, in <module>  
    File "/usr/lib/python2.6/encodings/utf_8.py", line 16, in decode  
        return codecs.utf_8_decode(input, errors, True)  
TypeError: don't know how to handle UnicodeDecodeError in error callback

Run Code Online (Sandbox Code Playgroud)

当我使用print语句时,我希望:

>>> print x  
u'Inglês'

Run Code Online (Sandbox Code Playgroud)

欢迎任何帮助.

Answer 1

Dan*_*ach 7

在解码输入数据之前,您需要知道输入数据的编码方式.在你们的一些尝试中,你试图从UTF-8解码它,但是Python抛出一个异常,因为输入是无效的UTF-8.看起来它可能是拉丁语-1.这对我有用:

>>> x = 'Ingl\xeas'
>>> print x.decode('latin1')
Inglês

Run Code Online (Sandbox Code Playgroud)

你提到"非ASCII HTML".如果您正在编写Web服务器脚本并且从HTTP请求获取数据,则应检查Content-Type标头.在理想的世界中,它会告诉您客户端使用哪种编码方式来处理数据.请记住,客户端可能工作不正常.

希望有所帮助!

归档时间：	16 年，3 月前
查看次数：	9829 次
最近记录：	16 年，3 月前