如何在python中的BeautifulSoup4中使用.next_sibling时忽略空行

Question

如何在python中的BeautifulSoup4中使用.next_sibling时忽略空行

sve*_*ann 6 python beautifulsoup html-parsing

由于我想删除html网站中的重复占位符,我使用BeautifulSoup的.next_sibling运算符.只要重复项位于同一行,就可以正常工作(参见数据).但有时它们之间有一条空行 - 所以我想.next_sibling忽略它们(看看data2)

那是代码:

from bs4 import BeautifulSoup, Tag
data = "<p>method-removed-here</p><p>method-removed-here</p><p>method-removed-here</p>"
data2 = """<p>method-removed-here</p>

<p>method-removed-here</p>

<p>method-removed-here</p>

<p>method-removed-here</p>

<p>method-removed-here</p>
"""
soup = BeautifulSoup(data)
string = 'method-removed-here'
for p in soup.find_all("p"):
    while isinstance(p.next_sibling, Tag) and p.next_sibling.name== 'p' and p.text==string:
        p.next_sibling.decompose()
print(soup)

Run Code Online (Sandbox Code Playgroud)

数据输出符合预期:

<html><head></head><body><p>method-removed-here</p></body></html>

Run Code Online (Sandbox Code Playgroud)

data2的输出(这需要修复):

<html><head></head><body><p>method-removed-here</p>

<p>method-removed-here</p>

<p>method-removed-here</p>

<p>method-removed-here</p>

<p>method-removed-here</p>
</body></html>

Run Code Online (Sandbox Code Playgroud)

我在BeautifulSoup4文档中找不到有用的信息,而.next_element也不是我想要的.

Answer 1

les*_*ana 7

使用find_next_sibling()而不是next_sibling. 也find_previous_sibling()代替previous_sibling.

原因：next_sibling不仅返回下一个 html 标签，还返回下一个“soup 元素”。通常这是标签之间的空白，但也可以更多。find_next_sibling()另一方面，返回下一个 html 标签，忽略标签之间的空格和其他内容。

我稍微重组了你的代码来进行这个演示。我希望它在语义上是相同的。

代码next_sibling演示了与您所描述的相同的行为（适用于data但不适用于data2）

from bs4 import BeautifulSoup, Tag
data = "<p>method-removed-here</p><p>method-removed-here</p><p>method-removed-here</p>"
data2 = """<p>method-removed-here</p>

<p>method-removed-here</p>

<p>method-removed-here</p>

<p>method-removed-here</p>

<p>method-removed-here</p>
"""
soup = BeautifulSoup(data, 'html.parser')
string = 'method-removed-here'
for p in soup.find_all("p"):
    while True:
        ns = p.next_sibling
        if isinstance(ns, Tag) and ns.name== 'p' and p.text==string:
            ns.decompose()
        else:
            break
print(soup)

Run Code Online (Sandbox Code Playgroud)

find_next_sibling()适用于data和的代码data2

soup = BeautifulSoup(data, 'html.parser')
string = 'method-removed-here'
for p in soup.find_all("p"):
    while True:
        ns = p.find_next_sibling()
        if isinstance(ns, Tag) and ns.name== 'p' and p.text==string:
            ns.decompose()
        else:
            break
print(soup)

Run Code Online (Sandbox Code Playgroud)

奖金：

the.children和.content还返回其间有空格的标签。相反，使用.find_all(True)仅返回标签。

请参阅此处了解更多信息：BeautifulSoup .children 或 .content 标签之间没有空格

Answer 2

sve*_*ann 6

我可以通过解决方法解决这个问题.这个问题在google-group for BeautifulSoup中有所描述,他们建议使用html文件的预处理器:

 def bs_preprocess(html):
     """remove distracting whitespaces and newline characters"""
     pat = re.compile('(^[\s]+)|([\s]+$)', re.MULTILINE)
     html = re.sub(pat, '', html)       # remove leading and trailing whitespaces
     html = re.sub('\n', ' ', html)     # convert newlines to spaces
                                        # this preserves newline delimiters
     html = re.sub('[\s]+<', '<', html) # remove whitespaces before opening tags
     html = re.sub('>[\s]+', '>', html) # remove whitespaces after closing tags
     return html

Run Code Online (Sandbox Code Playgroud)

这不是最好的解决方案,而是一个.

Answer 3

neu*_*nap 5

也不是一个很好的解决方案，但这对我有用

def get_sibling(element):
    sibling = element.next_sibling
    if sibling == "\n":
        return get_sibling(sibling)
    else:
        return sibling

Run Code Online (Sandbox Code Playgroud)

归档时间：	11 年，10 月前
查看次数：	1953 次
最近记录：	6 年，8 月前