奇怪的lxml行为

Flu*_*ffy 1 python lxml

请考虑以下代码段:

import lxml.html

html = '<div><br />Hello text</div>'
doc = lxml.html.fromstring(html)
text = doc.xpath('//text()')[0]
print lxml.html.tostring(text.getparent())
#prints <br>Hello text
Run Code Online (Sandbox Code Playgroud)

我期待看到'<div><br />Hello text</div>',因为br不能嵌套文本并且是"自我封闭"(我的意思是/>).如何lxml处理它?

nos*_*klo 8

HTML没有自动关闭标签.这是一个xml的东西.

import lxml.etree

html = '<div><br />Hello text</div>'
doc = lxml.etree.fromstring(html)
text = doc.xpath('//text()')[0]
print lxml.etree.tostring(text.getparent())
Run Code Online (Sandbox Code Playgroud)

版画

<br/>Hello text
Run Code Online (Sandbox Code Playgroud)

请注意,文本不在标记内.lxml有一个" tail"的概念.

>>> print text.text
None
>>> print text.tail
Hello text
Run Code Online (Sandbox Code Playgroud)