如何使用 BeautifulSoup (python) 防止关闭错误 HTML 中的标签？

Question

如何使用 BeautifulSoup (python) 防止关闭错误 HTML 中的标签？

paw*_*wel 5 python parsing beautifulsoup html-parsing

我会自动将 HTML 页面的内容翻译成不同的语言，因此我必须从有时写得很糟糕的不同 HTML 页面中提取所有文本节点（我无法编辑这些 HTML）。

通过使用 BeautifulSoup，我可以轻松提取这些文本并将其替换为翻译，但是当我在这些操作后显示 HTML 时： html = BeautifulSoup(source_html) - 它有时会损坏，因为 BeautifulSoup 会自动关闭标签（例如 table 标签在错误的位置关闭） .

有没有办法阻止 BeautifulSoup 关闭这些标签？

例如，这是我的输入：

html = "<table><tr><td>some text</td></table>" - 关闭 tr 丢失

在汤 = BeautufulSoup(html) 之后我得到 "<table><tr><td>some text</td></tr></table>"

我想获得与输入完全相同的html...

有可能吗？

Answer 1

Sha*_*hin 4

BeautifulSoup擅长从格式错误的 HTML/XML 中解析和提取数据，但如果损坏的 HTML 不明确，那么它会使用一组规则来解释标签（这可能不是您想要的）。请参阅文档中有关解析 HTML 的部分，该部分以一个听起来与您的情况非常相似的示例结尾。

如果您知道标签有什么问题并了解 BeautifulSoup 使用的规则，您也许可以稍微增强 HTML（也许删除或移动某些标签）以使 BeautifulSoup 返回您想要的输出。

如果您可以发布一个简短的示例，也许有人可以为您提供更具体的帮助。

更新（一些例子）

例如，考虑文档中给出的示例（上面链接）：

from BeautifulSoup import BeautifulSoup
html = """
<html>
<form>
 <table>
 <td><input name="input1">Row 1 cell 1
 <tr><td>Row 2 cell 1
 </form> 
 <td>Row 2 cell 2<br>This</br> sure is a long cell
</body> 
</html>"""
print BeautifulSoup(html).prettify()

Run Code Online (Sandbox Code Playgroud)

该<table>标签将在之前关闭，</form>以确保表格正确嵌套在表单内，留下最后的<td>悬挂。

如果我们理解了这个问题，我们可以通过在解析之前删除来获得正确的结束选项卡（</table>）："<form>"

>>> html = html.replace("<form>", "")
>>> soup = BeautifulSoup(html)
>>> print soup.prettify()
<html>
 <table>
  <td>
   <input name="input1" />
   Row 1 cell 1
  </td>
  <tr>
   <td>
    Row 2 cell 1
   </td>
   <td>
    Row 2 cell 2
    <br />
    This
    sure is a long cell
   </td>
  </tr>
 </table>
</html>

Run Code Online (Sandbox Code Playgroud)

如果<form>标签很重要，您仍然可以在解析后添加它。例如：

>>> new_form = Tag(soup, "form")  # create form element
>>> soup.html.insert(0, new_form)  # insert form as child of html
>>> new_form.insert(0, soup.table.extract()) # move table into form
>>> print soup.prettify()
<html>
 <form>
  <table>
   <td>
    <input name="input1" />
    Row 1 cell 1
   </td>
   <tr>
    <td>
     Row 2 cell 1
    </td>
    <td>
     Row 2 cell 2
     <br />
     This
     sure is a long cell
    </td>
   </tr>
  </table>
 </form>
</html>

Run Code Online (Sandbox Code Playgroud)

归档时间：	14 年，8 月前
查看次数：	3756 次
最近记录：	14 年，8 月前