BeautifulSoup 使用python删除除白名单中的“img”和“a”标签之外的所有html标签

Question

BeautifulSoup 使用python删除除白名单中的“img”和“a”标签之外的所有html标签

f12*_*6ck 5 html python parsing beautifulsoup html-parsing

给定一些 html 代码，如何删除所有标签，保留文本、img 和 a 标签？例如，我有

<div><script bla bla></script><p>Hello all <a href ="xx"></a> <img rscr="xx"></img></p></div>

Run Code Online (Sandbox Code Playgroud)

我想保留

Hello to <a href ="xx"></a> <img rscr="xx"></img>

Run Code Online (Sandbox Code Playgroud)

有没有用 BeautifulSoup 或 Python 实现的东西？

谢谢

Answer 1

Jos*_*ier 3

您可以通过访问属性来选择所有后代节点.descendants。

从那里，您可以迭代所有后代并根据属性过滤它们name。如果该节点没有属性name，那么它可能是您想要保留的文本节点。如果该name属性是a或img，那么您也保留它。

# This should be the wrapper that you are targeting
container = soup.find('div')
keep = []

for node in container.descendants:
  if not node.name or node.name == 'a' or node.name == 'img':
    keep.append(node)

Run Code Online (Sandbox Code Playgroud)

这是一种替代方法，其中所有过滤的元素都用于直接创建列表：

# This should be the wrapper that you are targeting
container = soup.find('div')

keep = [node for node in container.descendants
        if not node.name or node.name == 'a' or node.name == 'img']

Run Code Online (Sandbox Code Playgroud)

另外，如果您不希望返回空字符串，您可以修剪空格并检查：

keep = [node for node in container.descendants
        if (not node.name and len(node.strip())) or
           (node.name == 'a' or node.name == 'img')]

Run Code Online (Sandbox Code Playgroud)

根据您提供的 HTML，将返回以下内容：

> ['Hello all ', <a href="xx"></a>, <img rscr="xx"/>]

Run Code Online (Sandbox Code Playgroud)

归档时间：	8 年，11 月前
查看次数：	3145 次
最近记录：	8 年，11 月前