请考虑以下代码段:
import lxml.html
html = '<div><br />Hello text</div>'
doc = lxml.html.fromstring(html)
text = doc.xpath('//text()')[0]
print lxml.html.tostring(text.getparent())
#prints <br>Hello text
Run Code Online (Sandbox Code Playgroud)
我期待看到'<div><br />Hello text</div>',因为br不能嵌套文本并且是"自我封闭"(我的意思是/>).如何lxml处理它?
HTML没有自动关闭标签.这是一个xml的东西.
import lxml.etree
html = '<div><br />Hello text</div>'
doc = lxml.etree.fromstring(html)
text = doc.xpath('//text()')[0]
print lxml.etree.tostring(text.getparent())
Run Code Online (Sandbox Code Playgroud)
版画
<br/>Hello text
Run Code Online (Sandbox Code Playgroud)
请注意,文本不在标记内.lxml有一个" tail"的概念.
>>> print text.text
None
>>> print text.tail
Hello text
Run Code Online (Sandbox Code Playgroud)