如何在Python中使用unicode

Question

如何在Python中使用unicode

PyN*_*bie 15 python string unicode replace unicode-string

我试图清除字符串中的所有HTML,因此最终输出是一个文本文件.我对各种"转换器"进行了一些研究,并开始倾向于为实体和符号创建自己的字典并在字符串上运行替换.我正在考虑这个因为我想自动化这个过程,底层html的质量有很多变化.为了开始比较我的解决方案的速度和例如pyparsing的替代方案之一,我决定使用字符串方法replace来测试替换\ xa0.我得到了

UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 0: ordinal not in range(128)

Run Code Online (Sandbox Code Playgroud)

实际的代码行是

s=unicodestring.replace('\xa0','')

Run Code Online (Sandbox Code Playgroud)

无论如何 - 我决定我需要用r开头,所以我运行了这行代码:

s=unicodestring.replace(r'\xa0','')

Run Code Online (Sandbox Code Playgroud)

它运行没有错误,但是当我看到一片s时,我看到\ xaO仍在那里

Answer 1

z33*_*33m 25

也许你应该这样做

s=unicodestring.replace(u'\xa0',u'')

Run Code Online (Sandbox Code Playgroud)

Answer 2

dbr*_*dbr 6

s=unicodestring.replace('\xa0','')

Run Code Online (Sandbox Code Playgroud)

..尝试创建unicode字符\xa0,该字符在ASCII sctring中无效(Python中的默认字符串类型,直到版本3.x)

r'\xa0'没有错误的原因是因为在原始字符串中,转义序列没有效果.它没有尝试编码\xa0为unicode字符,而是将字符串视为"文字反斜杠","文字x"等等.

以下是相同的:

>>> r'\xa0'
'\\xa0'
>>> '\\xa0'
'\\xa0'

Run Code Online (Sandbox Code Playgroud)

这是在Python v3中解决的问题,因为默认的字符串类型是unicode,所以你可以做..

>>> '\xa0'
'\xa0'

Run Code Online (Sandbox Code Playgroud)

我试图清除字符串中的所有HTML,因此最终输出是一个文本文件

我强烈推荐BeautifulSoup.编写HTML清理工具很困难(考虑到大多数HTML是多么可怕),BeautifulSoup在解析HTML和处理Unicode方面做得很好.

>>> from BeautifulSoup import BeautifulSoup
>>> soup = BeautifulSoup("<html><body><h1>Hi</h1></body></html>")
>>> print soup.prettify()
<html>
 <body>
  <h1>
   Hi
  </h1>
 </body>
</html>

Run Code Online (Sandbox Code Playgroud)

归档时间：	16 年，4 月前
查看次数：	18261 次
最近记录：	11 年，2 月前