使用lxml和iterparse()来解析一个大的(+ - 1Gb)XML文件

Question

使用lxml和iterparse()来解析一个大的(+ - 1Gb)XML文件

mvi*_*ime 14 python xml parsing lxml iterparse

我必须使用如下结构解析1Gb XML文件,并在"Author"和"Content"标签中提取文本:

<Database>
    <BlogPost>
        <Date>MM/DD/YY</Date>
        <Author>Last Name, Name</Author>
        <Content>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas dictum dictum vehicula.</Content>
    </BlogPost>

    <BlogPost>
        <Date>MM/DD/YY</Date>
        <Author>Last Name, Name</Author>
        <Content>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas dictum dictum vehicula.</Content>
    </BlogPost>

    [...]

    <BlogPost>
        <Date>MM/DD/YY</Date>
        <Author>Last Name, Name</Author>
        <Content>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas dictum dictum vehicula.</Content>
    </BlogPost>
</Database>

Run Code Online (Sandbox Code Playgroud)

到目前为止,我已经尝试了两件事:i)读取整个文件并使用.find(xmltag)和ii)使用lxml和iterparse()解析xml文件.第一个选择我已经让它工作了,但它很慢.第二种选择我没有设法让它开始.

这是我所拥有的一部分:

for event, element in etree.iterparse(path_to_file, tag="BlogPost"):
    if element.tag == "BlogPost":
        print element.text
    else:
        print 'Finished'

Run Code Online (Sandbox Code Playgroud)

结果只是空格,没有文字.

我必须做错事,但我无法理解.另外,如果它不够明显,我对python很新,这是我第一次使用lxml.请帮忙!

Answer 1

and*_*oke 24

for event, element in etree.iterparse(path_to_file, tag="BlogPost"):
    for child in element:
        print child.tag, child.text
    element.clear()

Run Code Online (Sandbox Code Playgroud)

最后的清除将阻止你使用太多的内存.

[更新:]获取"作为字符串之间的所有内容"我想你想要一个:

for event, element in etree.iterparse(path_to_file, tag="BlogPost"):
    print etree.tostring(element)
    element.clear()

Run Code Online (Sandbox Code Playgroud)

要么

for event, element in etree.iterparse(path_to_file, tag="BlogPost"):
    print ''.join([etree.tostring(child) for child in element])
    element.clear()

Run Code Online (Sandbox Code Playgroud)

或者甚至是:

for event, element in etree.iterparse(path_to_file, tag="BlogPost"):
    print ''.join([child.text for child in element])
    element.clear()

Run Code Online (Sandbox Code Playgroud)

我还必须解析1.8 GB xml文件,并使用相同的clear函数来清除元素,但是clear()实际上不会从内存中删除元素,最后你最终会使用带有空元素的root记忆也.所以我使用"del"语句解析后删除了元素,这有助于我释放内存.阅读http://effbot.org/zone/element-iterparse.htm#incremental-parsing,了解究竟发生了什么. (3认同)
应该在后面的片段中``element.close()`是`element.clear()`吗？自从我写这篇文章以来,我不再记得了,但这对我来说是错误的. (2认同)

Answer 2

dav*_*ing 14

对于未来的搜索者:这里的最佳答案建议在每次迭代时清除元素,但这仍然会留下一组不断增加的空元素,这些元素将在内存中慢慢积累:

for event, element in etree.iterparse(path_to_file, tag="BlogPost"):
    for child in element:
        print child.tag, child.text
    element.clear()

Run Code Online (Sandbox Code Playgroud)

^这不是一个可扩展的解决方案,尤其是当您的源文件变得越来越大时.更好的解决方案是获取根元素,并在每次加载完整记录时清除它.这将使内存使用保持相当稳定(我会说低于20MB).

这是一个不需要查找特定标签的解决方案.此函数将返回一个生成器,该生成器生成根节点下的所有第一个子节点(例如<BlogPost>元素)(例如<Database>).它通过在根节点之后记录第一个标记的开头,然后等待相应的结束标记,产生整个元素,然后清除根节点来完成此操作.

from lxml import etree

xmlfile = '/path/to/xml/file.xml'

def iterate_xml(xmlfile):
    doc = etree.iterparse(xmlfile, events=('start', 'end'))
    _, root = next(doc)
    start_tag = None
    for event, element in doc:
        if event == 'start' and start_tag is None:
            start_tag = element.tag
        if event == 'end' and element.tag == start_tag:
            yield element
            start_tag = None
            root.clear()

Run Code Online (Sandbox Code Playgroud)

Answer 3

Lev*_*sky 5

我喜欢XPath这样的东西:

In [1]: from lxml.etree import parse

In [2]: tree = parse('/tmp/database.xml')

In [3]: for post in tree.xpath('/Database/BlogPost'):
   ...:     print 'Author:', post.xpath('Author')[0].text
   ...:     print 'Content:', post.xpath('Content')[0].text
   ...: 
Author: Last Name, Name
Content: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas dictum dictum vehicula.
Author: Last Name, Name
Content: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas dictum dictum vehicula.
Author: Last Name, Name
Content: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas dictum dictum vehicula.

Run Code Online (Sandbox Code Playgroud)

不过,我不确定它在处理大文件方面是否有所不同.关于这一点的评论将不胜感激.

按自己的方式行事,

for event, element in etree.iterparse(path_to_file, tag="BlogPost"):
     for info in element.iter():
         if info.tag in ('Author', 'Content'):
             print info.tag, ':', info.text

Run Code Online (Sandbox Code Playgroud)

我最近进行了一次比较,并且使用`clear()`的`iterparse`比仅仅使用`XPath`消耗**更少的内存. (7认同)

归档时间：	13 年，11 月前
查看次数：	18033 次
最近记录：	7 年，3 月前