从文本中删除所有HTML标记及其内容

Question

从文本中删除所有HTML标记及其内容

Ada*_*ver 9 html python beautifulsoup html-parsing

我想知道如何删除所有HTML标签及其内容BeautifulSoup.

输入:

... text <strong>ha</strong> ... text

Run Code Online (Sandbox Code Playgroud)

输出:

... text ... text

Run Code Online (Sandbox Code Playgroud)

Answer 1

ale*_*cxe 18

使用replace_with()(或replaceWith()):

from bs4 import BeautifulSoup, Tag


text = "text <strong>ha</strong> ... text"

soup = BeautifulSoup(text)

for tag in soup.find_all('strong'):
    tag.replaceWith('')

print soup.get_text()

Run Code Online (Sandbox Code Playgroud)

打印:

text  ... text

Run Code Online (Sandbox Code Playgroud)

或者,正如@mata建议的那样,您可以使用tag.decompose()而不是tag.replaceWith('')- 将产生相同的结果,但看起来更合适.

[`decompose`](http://www.crummy.com/software/BeautifulSoup/bs4/doc/#decompose)可能是更合适的选择. (7认同)

归档时间：	12 年，5 月前
查看次数：	13253 次
最近记录：	11 年，5 月前