Ali*_*cia 8 python lxml elementtree
假设我有以下XML文档:
<species>
Mammals: <dog/> <cat/>
Reptiles: <snake/> <turtle/>
Birds: <seagull/> <owl/>
</species>
Run Code Online (Sandbox Code Playgroud)
然后我得到这样的species
元素:
import lxml.etree
doc = lxml.etree.fromstring(xml)
species = doc.xpath('/species')[0]
Run Code Online (Sandbox Code Playgroud)
现在我想列出按物种分组的动物清单.我怎么能用ElementTree API做到这一点?
如果枚举所有节点,您将看到一个文本节点,其中的类后面跟有物种的元素节点:
>>> for node in species.xpath("child::node()"):
... print type(node), node
...
<class 'lxml.etree._ElementStringResult'>
Mammals:
<type 'lxml.etree._Element'> <Element dog at 0xe0b3c0>
<class 'lxml.etree._ElementStringResult'>
<type 'lxml.etree._Element'> <Element cat at 0xe0b410>
<class 'lxml.etree._ElementStringResult'>
Reptiles:
<type 'lxml.etree._Element'> <Element snake at 0xe0b460>
<class 'lxml.etree._ElementStringResult'>
<type 'lxml.etree._Element'> <Element turtle at 0xe0b4b0>
<class 'lxml.etree._ElementStringResult'>
Birds:
<type 'lxml.etree._Element'> <Element seagull at 0xe0b500>
<class 'lxml.etree._ElementStringResult'>
<type 'lxml.etree._Element'> <Element owl at 0xe0b550>
<class 'lxml.etree._ElementStringResult'>
Run Code Online (Sandbox Code Playgroud)
所以你可以从那里构建它:
my_species = {}
current_class = None
for node in species.xpath("child::node()"):
if isinstance(node, lxml.etree._ElementStringResult):
text = node.strip(' \n\t:')
if text:
current_class = my_species.setdefault(text, [])
elif isinstance(node, lxml.etree._Element):
if current_class is not None:
current_class.append(node.tag)
print my_species
Run Code Online (Sandbox Code Playgroud)
结果是
{'Mammals': ['dog', 'cat'], 'Reptiles': ['snake', 'turtle'], 'Birds': ['seagull', 'owl']}
Run Code Online (Sandbox Code Playgroud)
这一切都很脆弱......文本节点排列方式的微小变化可能会破坏解析.
@tdelaney 的答案基本上是正确的,但我想指出 Python 元素树 API 的一个细微差别。下面是一个报价的lxml
教程:
元素可以包含文本:
Run Code Online (Sandbox Code Playgroud)<root>TEXT</root>
在许多 XML 文档(以数据为中心的文档)中,这是唯一可以找到文本的地方。它由位于树层次结构最底部的叶标记封装。
但是,如果 XML 用于标记文本文档,例如 (X)HTML,文本也可以出现在不同元素之间,就在树的中间:
Run Code Online (Sandbox Code Playgroud)<html><body>Hello<br/>World</body></html>
在这里,
<br/>
标签被文本包围。这通常称为文档样式或混合内容 XML。元素通过其tail
属性支持这一点。它包含紧跟在元素之后的文本,直到 XML 树中的下一个元素。这两个性质
text
和tail
足以表示XML文档中的任何文本内容。这样,除了 Element 类之外, ElementTree API不需要任何特殊的文本节点,这些节点往往会经常妨碍(您可能从经典的 DOM API 中了解到)。
考虑到这些属性,可以在不强制树输出文本节点的情况下检索文档文本。
#!/usr/bin/env python3.3
import itertools
from pprint import pprint
try:
from lxml import etree
except ImportError:
from xml.etree import cElementTree as etree
def textAndElement(node):
'''In py33+ recursive generators are easy'''
yield node
text = node.text.strip() if node.text else None
if text:
yield text
for child in node:
yield from textAndElement(child)
tail = node.tail.strip() if node.tail else None
if tail:
yield tail
if __name__ == '__main__':
xml = '''
<species>
Mammals: <dog/> <cat/>
Reptiles: <snake/> <turtle/>
Birds: <seagull/> <owl/>
</species>
'''
doc = etree.fromstring(xml)
pprint(list(textAndElement(doc)))
#[<Element species at 0x7f2c538727d0>,
#'Mammals:',
#<Element dog at 0x7f2c538728c0>,
#<Element cat at 0x7f2c53872910>,
#'Reptiles:',
#<Element snake at 0x7f2c53872960>,
#<Element turtle at 0x7f2c538729b0>,
#'Birds:',
#<Element seagull at 0x7f2c53872a00>,
#<Element owl at 0x7f2c53872a50>]
gen = textAndElement(doc)
next(gen) # skip root
groups = []
for _, g in itertools.groupby(gen, type):
groups.append(tuple(g))
pprint(dict(zip(*[iter(groups)] * 2)) )
#{('Birds:',): (<Element seagull at 0x7fc37f38aaa0>,
# <Element owl at 0x7fc37f38a820>),
#('Mammals:',): (<Element dog at 0x7fc37f38a960>,
# <Element cat at 0x7fc37f38a9b0>),
#('Reptiles:',): (<Element snake at 0x7fc37f38aa00>,
# <Element turtle at 0x7fc37f38aa50>)}
Run Code Online (Sandbox Code Playgroud)