打破lxml.etree.HTML.xpath最大解析深度限制

Question

打破lxml.etree.HTML.xpath最大解析深度限制

lxml.etree 的 HTML xpath 解析器似乎有最大深度限制。如果深度超过 254，它不会进一步遍历来解析文本。下面是一个 Python 代码片段，演示了这一点：

import lxml.etree as etree

# Setup HTML tabs
x = "<span>"
x_ = "</span>"

# Set recursion depth to 255
depth = 255 

# Construct and parse using lxml.etree.HTML
# This gives an empty list []
print(etree.HTML(x * depth + "<p>text to be extracted</p >" + x_* depth).xpath("//p//text()"))

# Set the recursion depth to 254
depth = 254

# This gives the correct result ['text to be extracted']
print(etree.HTML(x * depth + "<p>text to be extracted</p >" + x_* depth).xpath("//p//text()"))

Run Code Online (Sandbox Code Playgroud)

在某些用例中，我们会遇到递归深度大于254的大型文本文件，那么lxml解析器将无法提供所需的文本。我们怎样才能突破限制，让它解析超过 254 次递归呢？

XSLT文档提供了一个名为的静态方法set_global_max_depth，使用户能够自定义它可以遍历的最大深度，lxml.etree.HTML中是否提供了类似的方法？

这篇邮件文章讨论了 XSLT 的遍历深度，可能会有所帮助。

Answer 1

nwe*_*hof 0

尝试使用设置为的自定义HTMLParser实例进行解析。另请参阅此问题的 XML 案例。（顺便说一句，这与 XPath 无关。）huge_treeTrue

归档时间：	7 年，7 月前
查看次数：	600 次
最近记录：	2 年，5 月前