使用Beautiful Soup从字符串中去除html标签

Question

使用Beautiful Soup从字符串中去除html标签

有没有人有一些示例代码说明如何使用Python的Beautiful Soup从一串文本中删除除一些标签之外的所有html标签？

我想删除所有javascript和html标签除外:

<a></a>
<b></b>
<i></i>

Run Code Online (Sandbox Code Playgroud)

还有:

<a onclick=""></a>

Run Code Online (Sandbox Code Playgroud)

感谢您的帮助 - 我在互联网上找不到这个目的.

Answer 1

unu*_*tbu 8

import BeautifulSoup

doc = '''<html><head><title>Page title</title></head><body><p id="firstpara" align="center">This is <i>paragraph</i> <a onclick="">one</a>.<p id="secondpara" align="blah">This is <i>paragraph</i> <b>two</b>.</html>'''
soup = BeautifulSoup.BeautifulSoup(doc)

for tag in soup.recursiveChildGenerator():
    if isinstance(tag,BeautifulSoup.Tag) and tag.name in ('a','b','i'):
        print(tag)

Run Code Online (Sandbox Code Playgroud)

产量

<i>paragraph</i>
<a onclick="">one</a>
<i>paragraph</i>
<b>two</b>

Run Code Online (Sandbox Code Playgroud)

如果您只想要文本内容,可以更改print(tag)为print(tag.string).

如果onclick=""要从a标记中删除属性,可以执行以下操作:

if isinstance(tag,BeautifulSoup.Tag) and tag.name in ('a','b','i'):
    if tag.name=='a':
        del tag['onclick']
    print(tag)

Run Code Online (Sandbox Code Playgroud)

归档时间：	15 年，2 月前
查看次数：	8407 次
最近记录：	15 年，2 月前