使用 BeautifulSoup 解析 HTML 时缺少特殊字符和标签

Question

使用 BeautifulSoup 解析 HTML 时缺少特殊字符和标签

Dav*_*ale 3 python parsing beautifulsoup html-parsing python-3.x

我正在尝试使用解析 HTML 文档BeautifulSoup我正在尝试使用Python

\n\n

但它会停止解析特殊字符，如下所示：

\n\n

from bs4 import BeautifulSoup\ndoc = '''\n<html>\n    <body>\n        <div>And I said \xc2\xabWhat the %&#@???\xc2\xbb</div>\n        <div>some other text</div>\n    </body>\n</html>'''\nsoup = BeautifulSoup(doc,  'html.parser')\nprint(soup)\n

Run Code Online (Sandbox Code Playgroud)\n\n

这段代码应该输出整个文档。相反，它只打印

\n\n

<html>\n<body>\n<div>And I said \xc2\xabWhat the %</div></body></html>\n

Run Code Online (Sandbox Code Playgroud)\n\n

该文件的其余部分显然已丢失。被组合阻止了'&#'。

\n\n

问题是，如何设置 BS 或预处理文档，以避免此类问题，但丢失尽可能少的文本（可能提供信息）？

\n\n

我在 Windows 10 上使用版本 4.6.0 的 bs4 和 Python 3.6.1。

\n\n

更新。该方法soup.prettify()不起作用，因为它soup已经损坏了。

\n

Answer 1

Moi*_*dri 5

您需要在对象中使用“html5lib”作为解析器而不是“html.parser” BeautifulSoup。例如：

\n\n

from bs4 import BeautifulSoup\ndoc = \'\'\'\n<html>\n    <body>\n        <div>And I said \xc2\xabWhat the %&#@???\xc2\xbb</div>\n        <div>some other text</div>\n    </body>\n</html>\'\'\'\n\nsoup = BeautifulSoup(doc,  \'html5lib\')\n#          different parser  ^\n

Run Code Online (Sandbox Code Playgroud)\n\n

现在，如果您要打印，soup它将显示您想要的字符串：

\n\n

>>> print(soup)\n<html><head></head><body>\n        <div>And I said \xc2\xabWhat the %&amp;#@???\xc2\xbb</div>\n        <div>some other text</div>\n\n</body></html>\n

Run Code Online (Sandbox Code Playgroud)\n\n

来自解析器之间的差异文档：

\n\n

\n
与不同的是html5lib，html.parser它不会尝试通过添加标签来创建格式良好的 HTML 文档。与 lxml 不同，它甚至不需要添加标签。
\n

\n

归档时间：	8 年前
查看次数：	962 次
最近记录：	6 年，11 月前