考虑以下Python脚本:
from lxml import etree
html = '''
<html xmlns="http://www.w3.org/1999/xhtml">
<head></head>
<body>
<p>This is some text followed with 2 citations.<span class="footnote">1</span>
<span ?lass="footnote">2</span>This is some more text.</p>
</body>
</html>'''
tree = etree.fromstring(html)
for element in tree.findall(".//{*}span"):
if element.get("class") == 'footnote':
print(etree.tostring(element, encoding="unicode", pretty_print=True))
Run Code Online (Sandbox Code Playgroud)
所需的输出将是2个span
元素,而是得到:
<span xmlns="http://www.w3.org/1999/xhtml" class="footnote">1</span>
<span xmlns="http://www.w3.org/1999/xhtml" class="footnote">2</span>This is some more text.
Run Code Online (Sandbox Code Playgroud)
为什么在元素之后直到父元素的末尾都包含文本?
我正在尝试使用lxml链接脚注,当我a.insert()
将span
元素添加到a
为其创建的元素中时,它包含之后的文本,因此链接了许多我不想链接的文本。