Beautifulsoup兄弟结构与br标签

Question

Beautifulsoup兄弟结构与br标签

我正在尝试使用BeautifulSoup Python库解析HTML文档,但结构会被<br>标记扭曲.让我举个例子.

输入HTML:

<div>
  some text <br>
  <span> some more text </span> <br>
  <span> and more text </span>
</div>

Run Code Online (Sandbox Code Playgroud)

BeautifulSoup解释的HTML:

<div>
  some text
  <br>
    <span> some more text </span>
    <br>
      <span> and more text </span>
    </br>
  </br>
</div>

Run Code Online (Sandbox Code Playgroud)

在源头,跨度可以被认为是兄弟姐妹.在解析之后(使用默认解析器),跨度突然不再是兄弟,因为br标签成为结构的一部分.

我可以想到解决这个问题的解决方案是<br>在将html注入Beautifulsoup之前完全剥离标签,但这似乎并不优雅,因为它需要我更改输入.有什么更好的方法来解决这个问题？

Answer 1

Ter*_*ryA 8

你最好的选择是extract()换行.它比你想象的容易:).

>>> from bs4 import BeautifulSoup as BS
>>> html = """<div>
...   some text <br>
...   <span> some more text </span> <br>
...   <span> and more text </span>
... </div>"""
>>> soup = BS(html)
>>> for linebreak in soup.find_all('br'):
...     linebreak.extract()
... 
<br/>
<br/>
>>> print soup.prettify()
<html>
 <body>
  <div>
   some text
   <span>
    some more text
   </span>
   <span>
    and more text
   </span>
  </div>
 </body>
</html>

Run Code Online (Sandbox Code Playgroud)

Answer 2

jns*_*jns 5

您也可以这样做：

str(soup).replace("</br>", "")

Run Code Online (Sandbox Code Playgroud)

Answer 3

red*_*Fur 5

这是一个非常老的问题，但是我也遇到了类似的问题，因为我的文档包含了closong </br>标签。因此，beatifulsoup只是忽略了大量文档（我想bs试图处理一个关闭标签）soup.find_all('br')实际上没有找到任何东西，因为没有打开br标签，所以我无法使用该extract()方法。

ash了一个小时后，我发现使用lxml解析器而不是默认的html可以解决此问题。

soup = BeautifulSoup(page, 'lxml')

归档时间：	12 年，8 月前
查看次数：	11847 次
最近记录：	8 年，8 月前