sgp*_*sgp 9 python xml parsing elementtree python-2.7
我想解析一个相当庞大的xml类文件,它没有任何根元素.该文件的格式为:
<tag1>
<tag2>
</tag2>
</tag1>
<tag1>
<tag3/>
</tag1>
Run Code Online (Sandbox Code Playgroud)
我尝试使用Element-Tree但它返回了"no root"错误.有没有其他python库可以用来解析这个文件?提前致谢!:)
PS:我尝试添加一个额外的标签来包装整个文件,然后使用Element-Tree解析它.但是,我想使用一些更有效的方法,其中我不需要改变原始的xml文件.
ElementTree.fromstringlist 接受一个iterable(产生字符串).
使用它itertools.chain:
import itertools
import xml.etree.ElementTree as ET
# import xml.etree.cElementTree as ET
with open('xml-like-file.xml') as f:
it = itertools.chain('<root>', f, '</root>')
root = ET.fromstringlist(it)
# Do something with `root`
root.find('.//tag3')
Run Code Online (Sandbox Code Playgroud)
lxml.html可以解析片段:
from lxml import html
s = """<tag1>
<tag2>
</tag2>
</tag1>
<tag1>
<tag3/>
</tag1>"""
doc = html.fromstring(s)
for thing in doc:
print thing
for other in thing:
print other
"""
>>>
<Element tag1 at 0x3411a80>
<Element tag2 at 0x3428990>
<Element tag1 at 0x3428930>
<Element tag3 at 0x3411a80>
>>>
"""
Run Code Online (Sandbox Code Playgroud)
礼貌这个答案
如果有多个嵌套级别:
def flatten(nested):
"""recusively flatten nested elements
yields individual elements
"""
for thing in nested:
yield thing
for other in flatten(thing):
yield other
doc = html.fromstring(s)
for thing in flatten(doc):
print thing
Run Code Online (Sandbox Code Playgroud)
同样,lxml.etree.HTML将解析这个.它添加了html和body标签:
d = etree.HTML(s)
for thing in d.iter():
print thing
"""
<Element html at 0x3233198>
<Element body at 0x322fcb0>
<Element tag1 at 0x3233260>
<Element tag2 at 0x32332b0>
<Element tag1 at 0x322fcb0>
<Element tag3 at 0x3233148>
"""
Run Code Online (Sandbox Code Playgroud)
如何而不是编辑文件做这样的事情
import xml.etree.ElementTree as ET
with file("xml-file.xml") as f:
xml_object = ET.fromstringlist(["<root>", f.read(), "</root>"])
Run Code Online (Sandbox Code Playgroud)