fpg*_*ost 2 python xml gzip elementtree
我正在尝试在python中将以下Feed解析为ElementTree:" http://smarkets.s3.amazonaws.com/oddsfeed.xml "(警告大文件)
这是我到目前为止所尝试的:
feed = urllib.urlopen("http://smarkets.s3.amazonaws.com/oddsfeed.xml")
# feed is compressed
compressed_data = feed.read()
import StringIO
compressedstream = StringIO.StringIO(compressed_data)
import gzip
gzipper = gzip.GzipFile(fileobj=compressedstream)
data = gzipper.read()
# Parse XML
tree = ET.parse(data)
Run Code Online (Sandbox Code Playgroud)
但似乎只是坚持下去compressed_data = feed.read(),无限可能?(我知道这是一个很大的文件,但与我解析的其他非压缩源相比似乎太长了,而且这一点很大程度上会从gzip压缩中获得任何带宽增益).
接下来我尝试requests了
url = "http://smarkets.s3.amazonaws.com/oddsfeed.xml"
headers = {'accept-encoding': 'gzip, deflate'}
r = requests.get(url, headers=headers, stream=True)
Run Code Online (Sandbox Code Playgroud)
但现在
tree=ET.parse(r.content)
Run Code Online (Sandbox Code Playgroud)
要么
tree=ET.parse(r.text)
Run Code Online (Sandbox Code Playgroud)
但这些提出了例外.
这样做的正确方法是什么?
您可以urlopen()直接传递返回的值GzipFile(),然后您可以将其传递给以下ElementTree方法iterparse():
#!/usr/bin/env python3
import xml.etree.ElementTree as etree
from gzip import GzipFile
from urllib.request import urlopen, Request
with urlopen(Request("http://smarkets.s3.amazonaws.com/oddsfeed.xml",
headers={"Accept-Encoding": "gzip"})) as response, \
GzipFile(fileobj=response) as xml_file:
for elem in getelements(xml_file, 'interesting_tag'):
process(elem)
Run Code Online (Sandbox Code Playgroud)
其中getelements()允许解析不适合内存的文件.
def getelements(filename_or_file, tag):
"""Yield *tag* elements from *filename_or_file* xml incrementaly."""
context = iter(etree.iterparse(filename_or_file, events=('start', 'end')))
_, root = next(context) # get root element
for event, elem in context:
if event == 'end' and elem.tag == tag:
yield elem
root.clear() # free memory
Run Code Online (Sandbox Code Playgroud)
为了保留内存,在每个标记元素上清除构造的xml树.