Sum*_*ant 5 python xml validation xpath xsd
我需要读取一个大型 XML (65 Mb),根据 xsd 对其进行验证,然后对其运行 XPath 查询。下面,我给出了一个 lxml 版本。运行查询需要很多时间(超过 5 分钟),但验证似乎非常快。
我有几个问题。一个注重性能的 Python 程序员如何使用 lxml 编写程序?其次,如果 lxml 不适合这份工作,还有什么?你能给出一个代码片段吗?
import sys
from datetime import datetime
from lxml import etree
start = datetime.now()
schema_file = open("library.xsd")
schema = etree.XMLSchema(file=schema_file)
parser = etree.XMLParser(schema = schema)
data_file = open(sys.argv[1], 'r')
tree = etree.parse(data_file, parser)
root = tree.getroot()
data_file.close()
schema_file.close()
end = datetime.now()
delta = end-start
print "Parsing time = ", delta
start = datetime.now()
name_list = root.xpath("book/author/name/text()")
print ("Size of list = " + str(len(name_list)))
end = datetime.now()
delta = end-start
print "Query time = ", delta
Run Code Online (Sandbox Code Playgroud)
lxml 基准测试非常有用。在我看来,使用 XPath 提取元素节点很快,但提取文本可能很慢。下面,我提供了三种非常快的解决方案。
def bench_lxml_xpath_direct(root): # Very slow but very fast if text() is removed.
name_list = root.xpath("book/author/name/text()")
print ("Size of list = " + str(len(name_list)))
def bench_lxml_xpath_loop(root): # Fast
name_list = root.xpath("book/author/name")
result = []
for n in name_list:
result.append(n.text)
print ("Size of list = " + str(len(name_list)))
def bench_lxml_getiterator(tree): # Very fast
result = []
for name in tree.getiterator("name"):
result.append(name.text)
print ("Size of list = " + str(len(result)))
def bench_lxml_findall(tree): # Superfast
result = []
for name in tree.findall("//name"):
result.append(name.text)
print ("Size of list = " + str(len(result)))
Run Code Online (Sandbox Code Playgroud)