如何替换或删除" "等HTML实体使用BeautifulSoup 4

Question

如何替换或删除" "等HTML实体使用BeautifulSoup 4

我正在使用Python和BeautifulSoup 4库处理HTML,我找不到 用空格替换的明显方法.相反,它似乎被转换为Unicode非破坏空格字符.

我错过了一些明显的东西吗什么是更换的最佳方式使用BeautifulSoup的正常空间？

编辑添加我使用的是最新版本BeautifulSoup 4,因此convertEntities=BeautifulSoup.HTML_ENTITIESBeautiful Soup 3中的选项不可用.

Answer 1

>>> soup = BeautifulSoup('<div>a&nbsp;b</div>')
>>> soup.prettify(formatter=lambda s: s.replace(u'\xa0', ' '))
u'<html>\n <body>\n  <div>\n   a b\n  </div>\n </body>\n</html>'

Run Code Online (Sandbox Code Playgroud)

Answer 2

Mar*_*ers 14

请参阅文档中的实体.BeautifulSoup 4为所有实体生成适当的Unicode:

传入的HTML或XML实体始终转换为相应的Unicode字符.

是的, 变成了一个不间断的空间角色.如果你真的希望那些是空格字符,你将不得不做一个unicode替换.

Answer 3

小智 9

我只想用unicode替换不间断的空间.

nonBreakSpace = u'\xa0'
soup = soup.replace(nonBreakSpace, ' ')

Run Code Online (Sandbox Code Playgroud)

好处是,即使您使用的是BeautifulSoup,也不需要.

Answer 4

als*_*str 8

诚然，这不是使用 BeautifulSoup，但今天更简单的解决方案可能是html.unescape和的某种组合unicodedata.normalize，具体取决于您的数据和您想要执行的操作。

>>> from html import unescape
>>> s = unescape('An enthusiastic member of the&nbsp;community')# Using the import here
>>> print(s)
>>> 'An enthusiastic member of the\xa0community'
>>> import unicodedata
>>> s = unicodedata.normalize('NFKC', s)
>>> print(s)
>>> 'An enthusiastic member of the community'

Run Code Online (Sandbox Code Playgroud)

Answer 5

Mor*_*enB 5

我遇到了 soup.prettify() 无法修复的 json 问题，因此它与unicodedata.normalize()一起使用：

import unicodedata
soup = BeautifulSoup(r.text, 'html.parser')
dat = soup.find('span', attrs={'class': 'date'})
print(f"date prints fine:'{dat.text}'")
print(f"json:{json.dumps(dat.text)}")
mydate = unicodedata.normalize("NFKD",dat.text)
print(f"json after normalizing:'{json.dumps(mydate)}'")

Run Code Online (Sandbox Code Playgroud)

date prints fine:'03 Nov 19 17:51'
json:"03\u00a0Nov\u00a019\u00a017:51"
json after normalizing:'"03 Nov 19 17:51"'

Run Code Online (Sandbox Code Playgroud)

归档时间：	13 年，2 月前
查看次数：	25865 次
最近记录：	6 年，6 月前

如何替换或删除"&nbsp;"等HTML实体 使用BeautifulSoup 4

如何替换或删除" "等HTML实体使用BeautifulSoup 4