Python无法将糟糕的unicode编码为ascii

Question

Python无法将糟糕的unicode编码为ascii

我有一些Python代码正在接收带有错误unicode的字符串.当我试图忽略坏字符时,Python仍然会窒息(版本2.6.1).以下是如何重现它:

s = 'ad\xc2-ven\xc2-ture'
s.encode('utf8', 'ignore')

Run Code Online (Sandbox Code Playgroud)

它抛出

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 2: ordinal not in range(128)

Run Code Online (Sandbox Code Playgroud)

我究竟做错了什么？

Answer 1

Sve*_*ach 10

将字符串转换为unicode实例str.decode()在Python 2.x中:

 >>> s.decode("ascii", "ignore")
 u'ad-ven-ture'

Run Code Online (Sandbox Code Playgroud)

Answer 2

Tho*_*ers 8

你混淆了"unicode"和"utf-8".你的字符串s不是unicode; 它是特定编码中的字节串(但不是UTF-8,更可能是iso-8859-1等).从字节串开始,unicode通过解码数据而不是编码来完成.从unicode到bytestring是编码.也许你打算制作s一个unicode字符串:

>>> s = u'ad\xc2-ven\xc2-ture'
>>> s.encode('utf8', 'ignore')
'ad\xc3\x82-ven\xc3\x82-ture'

Run Code Online (Sandbox Code Playgroud)

或者您可能希望将bytestring视为UTF-8但忽略无效序列,在这种情况下,您将使用'ignore'作为错误处理程序解码 bytestring:

>>> s = 'ad\xc2-ven\xc2-ture'
>>> u = s.decode('utf-8', 'ignore')
>>> u
u'adventure'
>>> u.encode('utf-8')
'adventure'

Run Code Online (Sandbox Code Playgroud)

归档时间：	14 年，6 月前
查看次数：	4433 次
最近记录：	14 年，6 月前