use*_*034 14 python xml lxml elementtree xml.etree
我通过xpath废弃了一些html,然后我转换成了etree.与此类似的东西:
<td> text1 <a> link </a> text2 </td>
Run Code Online (Sandbox Code Playgroud)
但是当我调用element.text时,我只得到text1(它必须在那里,当我在FireBug中检查我的查询时,元素的文本被突出显示,嵌入的锚元素之前和之后的文本......
作为对那些可能像我一样懒惰的人的公共服务.以下是您可以运行的上面的一些代码.
from lxml import etree
def get_text1(node):
result = node.text or ""
for child in node:
if child.tail is not None:
result += child.tail
return result
def get_text2(node):
return ((node.text or '') +
''.join(map(get_text2, node)) +
(node.tail or ''))
def get_text3(node):
return (node.text or "") + "".join(
[etree.tostring(child) for child in node.iterchildren()])
root = etree.fromstring(u"<td> text1 <a> link </a> text2 </td>")
print root.xpath("text()")
print get_text1(root)
print get_text2(root)
print root.xpath("string()")
print etree.tostring(root, method = "text")
print etree.tostring(root, method = "xml")
print get_text3(root)
Run Code Online (Sandbox Code Playgroud)
输出是:
snowy:rpg$ python test.py
[' text1 ', ' text2 ']
text1 text2
text1 link text2
text1 link text2
text1 link text2
<td> text1 <a> link </a> text2 </td>
text1 <a> link </a> text2
Run Code Online (Sandbox Code Playgroud)
对我来说看起来像一个lxml错误,但根据设计,如果你阅读文档.我已经解决了这个问题:
def node_text(node):
if node.text:
result = node.text
else:
result = ''
for child in node:
if child.tail is not None:
result += child.tail
return result
Run Code Online (Sandbox Code Playgroud)