对于80 + GB XML,Python sax到lxml

Nic*_*ick 11 python lxml sax

您如何使用sax读取XML文件并将其转换为lxml etree.iterparse元素?

为了提供问题的概述,我构建了一个XML摄取工具,使用lxml作为XML源,其大小范围为25 - 500MB,需要每日摄取一次,但需要执行一次摄取一个60 - 100GB的文件.

我选择使用lxml基于详细说明节点不会超过4-8 GB的规格,我认为这样可以将节点读入内存并在完成后清除.

如果代码如下,请概述

elements = etree.iterparse(
    self._source, events = ('end',)
)
for event, element in elements:
    finished = True
    if element.tag == 'Artist-Types':
        self.artist_types(element)

def artist_types(self, element):
    """
    Imports artist types

    :param list element: etree.Element
    :returns boolean:
    """
    self._log.info("Importing Artist types")
    count = 0
    for child in element:
        failed = False
        fields = self._getElementFields(child, (
            ('id', 'Id'),
            ('type_code', 'Type-Code'),
            ('created_date', 'Created-Date')
        ))
        if self._type is IMPORT_INC and has_artist_type(fields['id']):
            if update_artist_type(fields['id'], fields['type_code']):
                count = count + 1
            else:
                failed = True
        else:
            if create_artist_type(fields['type_code'],
                fields['created_date'], fields['id']):
                count = count + 1
            else:
                failed = True
        if failed:
            self._log.error("Failed to import artist type %s %s" %
                (fields['id'], fields['type_code'])
            )
    self._log.info("Imported %d Artist Types Records" % count)
    self._artist_type_count = count
    self._cleanup(element)
    del element
Run Code Online (Sandbox Code Playgroud)

如果我可以添加任何类型的说明,请告诉我.

Fra*_*ila 25

iterparse是一个迭代解析器.它将发出Element对象和事件,并在Element解析时逐步构建整个树,因此最终它将整个树存储在内存中.

但是,很容易产生有限的内存行为:删除解析它们时不再需要的元素.

典型的"巨型xml"工作负载是单个根元素,其中包含大量表示记录的子元素.我假设这是你正在使用的那种XML结构?

通常用它clear()来清空你正在处理的元素就足够了.你的内存使用量会增长一点,但不是很多.如果你有一个非常庞大的文件,那么即使是空Element对象也会消耗太多,在这种情况下你还必须删除以前看过的Element对象.请注意,您无法安全地删除当前元素.该lxml.etree.iterparse文档描述了这种技术.

在这种情况下,您将在每次</record>找到a时处理记录,然后您将删除所有以前的记录元素.

下面是使用无限长XML文档的示例.它将在解析时打印进程的内存使用情况.请注意,内存使用情况稳定,不会继续增长.

from lxml import etree
import resource

class InfiniteXML (object):
    def __init__(self):
        self._root = True
    def read(self, len=None):
        if self._root:
            self._root=False
            return "<?xml version='1.0' encoding='US-ASCII'?><records>\n"
        else:
            return """<record>\n\t<ancestor attribute="value">text value</ancestor>\n</record>\n"""

def parse(fp):
    context = etree.iterparse(fp, events=('end',))
    for action, elem in context:
        if elem.tag=='record':
            # processing goes here
            pass

        #memory usage
        print resource.getrusage(resource.RUSAGE_SELF).ru_maxrss

        # cleanup
        # first empty children from current element
            # This is not absolutely necessary if you are also deleting siblings,
            # but it will allow you to free memory earlier.
        elem.clear()
        # second, delete previous siblings (records)
        while elem.getprevious() is not None:
            del elem.getparent()[0]
        # make sure you have no references to Element objects outside the loop

parse(InfiniteXML())
Run Code Online (Sandbox Code Playgroud)