使用BeautifulSoup删除标记但保留其内容

Question

使用BeautifulSoup删除标记但保留其内容

目前我的代码执行如下操作:

soup = BeautifulSoup(value)

for tag in soup.findAll(True):
    if tag.name not in VALID_TAGS:
        tag.extract()
soup.renderContents()

Run Code Online (Sandbox Code Playgroud)

除了我不想丢弃无效标签内的内容.如何在删除标签但在调用soup.renderContents()时保留内容？

Answer 1

sla*_*acy 68

当前版本的BeautifulSoup库在Tag对象上有一个名为replaceWithChildren()的未记录方法.所以,你可以这样做:

html = "<p>Good, <b>bad</b>, and <i>ug<b>l</b><u>y</u></i></p>"
invalid_tags = ['b', 'i', 'u']
soup = BeautifulSoup(html)
for tag in invalid_tags: 
    for match in soup.findAll(tag):
        match.replaceWithChildren()
print soup

Run Code Online (Sandbox Code Playgroud)

看起来它的行为与您想要的一样,并且是相当简单的代码(尽管它确实通过DOM进行了一些传递,但这可以很容易地进行优化.)

我喜欢简单。请注意，replaceWithChildren() 方法已在 BS4 中替换为 unwrap() (17认同)
这应该是答案. (7认同)

Answer 2

Jes*_*lon 56

我使用的策略是将标签替换为其内容,如果它们是类型的NavigableString,如果它们不是,则递归到它们中并用NavigableString等替换它们的内容.试试这个:

from BeautifulSoup import BeautifulSoup, NavigableString

def strip_tags(html, invalid_tags):
    soup = BeautifulSoup(html)

    for tag in soup.findAll(True):
        if tag.name in invalid_tags:
            s = ""

            for c in tag.contents:
                if not isinstance(c, NavigableString):
                    c = strip_tags(unicode(c), invalid_tags)
                s += unicode(c)

            tag.replaceWith(s)

    return soup

html = "<p>Good, <b>bad</b>, and <i>ug<b>l</b><u>y</u></i></p>"
invalid_tags = ['b', 'i', 'u']
print strip_tags(html, invalid_tags)

Run Code Online (Sandbox Code Playgroud)

结果是:

<p>Good, bad, and ugly</p>

Run Code Online (Sandbox Code Playgroud)

我在另一个问题上给出了同样的答案.它似乎出现了很多.

Answer 3

cor*_*ord 17

虽然评论中已经有其他人提到了这一点,但我想我会发布一个完整的答案,展示如何使用Mozilla的Bleach.就个人而言,我认为这比使用BeautifulSoup要好得多.

import bleach
html = "<b>Bad</b> <strong>Ugly</strong> <script>Evil()</script>"
clean = bleach.clean(html, tags=[], strip=True)
print clean # Should print: "Bad Ugly Evil()"

Run Code Online (Sandbox Code Playgroud)

Answer 4

Eti*_*nne 10

我有一个更简单的解决方案,但我不知道它是否有缺点.

更新:有一个缺点,请参阅Jesse Dhillon的评论.另外,另一种解决方案是使用Mozilla的Bleach而不是BeautifulSoup.

from BeautifulSoup import BeautifulSoup

VALID_TAGS = ['div', 'p']

value = '<div><p>Hello <b>there</b> my friend!</p></div>'

soup = BeautifulSoup(value)

for tag in soup.findAll(True):
    if tag.name not in VALID_TAGS:
        tag.replaceWith(tag.renderContents())

print soup.renderContents()

Run Code Online (Sandbox Code Playgroud)

这也将<div><p>Hello there my friend!</p></div>根据需要打印.

Answer 5

小智 8

你可以使用soup.text

.text删除所有标记并连接所有文本.

Answer 6

Ale*_*lli 7

在删除标签之前,您可能必须将标签的子项移动为标记父项的子项 - 这是您的意思吗？

如果是这样,那么,虽然在正确的位置插入内容是棘手的,这样的事情应该工作:

from BeautifulSoup import BeautifulSoup

VALID_TAGS = 'div', 'p'

value = '<div><p>Hello <b>there</b> my friend!</p></div>'

soup = BeautifulSoup(value)

for tag in soup.findAll(True):
    if tag.name not in VALID_TAGS:
        for i, x in enumerate(tag.parent.contents):
          if x == tag: break
        else:
          print "Can't find", tag, "in", tag.parent
          continue
        for r in reversed(tag.contents):
          tag.parent.insert(i, r)
        tag.extract()
print soup.renderContents()

Run Code Online (Sandbox Code Playgroud)

使用示例值,可以<div><p>Hello there my friend!</p></div>根据需要进行打印.

@Jason,除了需要最外面的标签之外,你提供的字符串是完全有效的,并且与我给出的代码保持不变,所以我完全不知道你的评论**关于**! (3认同)

归档时间：	15 年，11 月前
查看次数：	52887 次
最近记录：	6 年，4 月前