Vin*_*iri 17 python sql out-of-memory pickle
我遇到的问题是我有一个非常大的pickle文件(2.6 Gb),我试图打开,但每次我这样做,我得到一个内存错误.我现在意识到我应该使用数据库存储所有信息,但现在已经太晚了.pickle文件包含来自互联网的美国国会记录中的日期和文本(运行大约需要2周).有什么方法可以访问我逐步转储到pickle文件中的信息,或者将pickle文件转换为sql数据库或其他我无法重新输入所有数据的其他东西.我真的不想再花两周时间重新抓取国会记录并将数据输入数据库.
非常感谢你的帮助
编辑*
对象如何被腌制的代码:
def save_objects(objects):
with open('objects.pkl', 'wb') as output:
pickle.dump(objects, output, pickle.HIGHEST_PROTOCOL)
def Main():
Links()
file = open("datafile.txt", "w")
objects=[]
with open('links2.txt', 'rb') as infile:
for link in infile:
print link
title,text,date=Get_full_text(link)
article=Doccument(title,date,text)
if text != None:
write_to_text(date,text)
objects.append(article)
save_objects(objects)
Run Code Online (Sandbox Code Playgroud)
这是带错误的程序:
def Main():
file= open('objects1.pkl', 'rb')
object = pickle.load(file)
Run Code Online (Sandbox Code Playgroud)
vge*_*gel 32
看起来你有点腌渍!;-).希望在此之后,你永远不会使用PICKLE EVER.它不是一个非常好的数据存储格式.
无论如何,对于这个答案,我假设你的Document课看起来有点像这样.如果没有,请评论您的实际Document课程:
class Document(object): # <-- object part is very important! If it's not there, the format is different!
def __init__(self, title, date, text): # assuming all strings
self.title = title
self.date = date
self.text = text
Run Code Online (Sandbox Code Playgroud)
无论如何,我用这个类做了一些简单的测试数据:
d = [Document(title='foo', text='foo is good', date='1/1/1'), Document(title='bar', text='bar is better', date='2/2/2'), Document(title='baz', text='no one likes baz :(', date='3/3/3')]
Run Code Online (Sandbox Code Playgroud)
用格式腌制2(pickle.HIGHEST_PROTOCOL适用于Python 2.x)
>>> s = pickle.dumps(d, 2)
>>> s
'\x80\x02]q\x00(c__main__\nDocument\nq\x01)\x81q\x02}q\x03(U\x04dateq\x04U\x051/1/1q\x05U\x04textq\x06U\x0bfoo is goodq\x07U\x05titleq\x08U\x03fooq\tubh\x01)\x81q\n}q\x0b(h\x04U\x052/2/2q\x0ch\x06U\rbar is betterq\rh\x08U\x03barq\x0eubh\x01)\x81q\x0f}q\x10(h\x04U\x053/3/3q\x11h\x06U\x13no one likes baz :(q\x12h\x08U\x03bazq\x13ube.'
Run Code Online (Sandbox Code Playgroud)
并用以下方式拆卸它pickletools:
>>> pickletools.dis(s)
0: \x80 PROTO 2
2: ] EMPTY_LIST
3: q BINPUT 0
5: ( MARK
6: c GLOBAL '__main__ Document'
25: q BINPUT 1
27: ) EMPTY_TUPLE
28: \x81 NEWOBJ
29: q BINPUT 2
31: } EMPTY_DICT
32: q BINPUT 3
34: ( MARK
35: U SHORT_BINSTRING 'date'
41: q BINPUT 4
43: U SHORT_BINSTRING '1/1/1'
50: q BINPUT 5
52: U SHORT_BINSTRING 'text'
58: q BINPUT 6
60: U SHORT_BINSTRING 'foo is good'
73: q BINPUT 7
75: U SHORT_BINSTRING 'title'
82: q BINPUT 8
84: U SHORT_BINSTRING 'foo'
89: q BINPUT 9
91: u SETITEMS (MARK at 34)
92: b BUILD
93: h BINGET 1
95: ) EMPTY_TUPLE
96: \x81 NEWOBJ
97: q BINPUT 10
99: } EMPTY_DICT
100: q BINPUT 11
102: ( MARK
103: h BINGET 4
105: U SHORT_BINSTRING '2/2/2'
112: q BINPUT 12
114: h BINGET 6
116: U SHORT_BINSTRING 'bar is better'
131: q BINPUT 13
133: h BINGET 8
135: U SHORT_BINSTRING 'bar'
140: q BINPUT 14
142: u SETITEMS (MARK at 102)
143: b BUILD
144: h BINGET 1
146: ) EMPTY_TUPLE
147: \x81 NEWOBJ
148: q BINPUT 15
150: } EMPTY_DICT
151: q BINPUT 16
153: ( MARK
154: h BINGET 4
156: U SHORT_BINSTRING '3/3/3'
163: q BINPUT 17
165: h BINGET 6
167: U SHORT_BINSTRING 'no one likes baz :('
188: q BINPUT 18
190: h BINGET 8
192: U SHORT_BINSTRING 'baz'
197: q BINPUT 19
199: u SETITEMS (MARK at 153)
200: b BUILD
201: e APPENDS (MARK at 5)
202: . STOP
Run Code Online (Sandbox Code Playgroud)
看起来很复杂 但实际上,它并没有那么糟糕.pickle基本上是一个堆栈机器,你看到的每个ALL_CAPS标识符都是一个操作码,它以某种方式操作内部"堆栈"进行解码.如果我们试图解析一些复杂的结构,这将更为重要,但幸运的是,我们只是制作一个基本元组的简单列表.所有这些"代码"正在做的是在堆栈上构造一堆对象,然后将整个堆栈推入列表中.
我们需要关心的一件事是你看到散落的'BINPUT'/'BINGET'操作码.基本上,这些是用于"memoization",减少数据占用空间,pickle保存字符串BINPUT <id>,然后如果它们再次出现,而不是重新转储它们,只需放入一个BINGET <id>从缓存中检索它们.
另外,另一个复杂功能!不仅仅是SHORT_BINSTRING- BINSTRING字符串> 256字节是正常的,还有一些有趣的unicode变体.我只是假设你使用Python 2和所有ASCII字符串.再次,如果这不是一个正确的假设,请评论.
好的,所以我们需要流式传输文件,直到我们点击'\ 81'字节(NEWOBJ).然后,我们需要向前扫描直到我们点击'('(MARK)字符.然后,直到我们点击'u'(SETITEMS),我们读取键/值字符串对 - 总共应该有3对,每个字段一个.
所以,让我们这样做.这是我以流媒体方式阅读pickle数据的脚本.它远非完美,因为我只是为了这个答案而将它一起攻击,你需要对它进行很多修改,但这是一个好的开始.
pickledata = '\x80\x02]q\x00(c__main__\nDocument\nq\x01)\x81q\x02}q\x03(U\x04dateq\x04U\x051/1/1q\x05U\x04textq\x06U\x0bfoo is goodq\x07U\x05titleq\x08U\x03fooq\tubh\x01)\x81q\n}q\x0b(h\x04U\x052/2/2q\x0ch\x06T\x14\x05\x00\x00bar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterq\rh\x08U\x03barq\x0eubh\x01)\x81q\x0f}q\x10(h\x04U\x053/3/3q\x11h\x06U\x13no one likes baz :(q\x12h\x08U\x03bazq\x13ube.'
# simulate a file here
import StringIO
picklefile = StringIO.StringIO(pickledata)
import pickle # just for opcode names
import struct # binary unpacking
def try_memo(f, v, cache):
opcode = f.read(1)
if opcode == pickle.BINPUT:
cache[f.read(1)] = v
elif opcode == pickle.LONG_BINPUT:
print 'skipping LONG_BINPUT to save memory, LONG_BINGET will probably not be used'
f.read(4)
else:
f.seek(f.tell() - 1) # rewind
def try_read_string(f, opcode, cache):
if opcode in [ pickle.SHORT_BINSTRING, pickle.BINSTRING ]:
length_type = 'b' if opcode == pickle.SHORT_BINSTRING else 'i'
str_length = struct.unpack(length_type, f.read(struct.calcsize(length_type)))[0]
value = f.read(str_length)
try_memo(f, value, memo_cache)
return value
elif opcode == pickle.BINGET:
return memo_cache[f.read(1)]
elif opcide == pickle.LONG_BINGET:
raise Exception('Unexpected LONG_BINGET? Key ' + f.read(4))
else:
raise Exception('Invalid opcode ' + opcode + ' at pos ' + str(f.tell()))
memo_cache = {}
while True:
c = picklefile.read(1)
if c == pickle.NEWOBJ:
while picklefile.read(1) != pickle.MARK:
pass # scan forward to field instantiation
fields = {}
while True:
opcode = picklefile.read(1)
if opcode == pickle.SETITEMS:
break
key = try_read_string(picklefile, opcode, memo_cache)
value = try_read_string(picklefile, picklefile.read(1), memo_cache)
fields[key] = value
print 'Document', fields
# insert to sqllite
elif c == pickle.STOP:
break
Run Code Online (Sandbox Code Playgroud)
这正确地以pickle格式2读取我的测试数据(修改为具有长字符串):
$ python picklereader.py
Document {'date': '1/1/1', 'text': 'foo is good', 'title': 'foo'}
Document {'date': '2/2/2', 'text': 'bar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is better', 'title': 'bar'}
Document {'date': '3/3/3', 'text': 'no one likes baz :(', 'title': 'baz'}
Run Code Online (Sandbox Code Playgroud)
祝好运!
您没有逐步挑选数据.你整体地反复腌制你的数据.每次循环时,都会销毁你拥有的任何输出数据(open(...,'wb')销毁输出文件),并再次重写所有数据.此外,如果您的程序曾停止然后使用新输入数据重新启动,则旧输出数据将丢失.
我不知道为什么objects在你腌制时没有引起内存不足的错误,因为它增长到与pickle.load()想要创建的对象相同的大小.
以下是如何以增量方式创建pickle文件:
def save_objects(objects):
with open('objects.pkl', 'ab') as output: # Note: `ab` appends the data
pickle.dump(objects, output, pickle.HIGHEST_PROTOCOL)
def Main():
...
#objects=[] <-- lose the objects list
with open('links2.txt', 'rb') as infile:
for link in infile:
...
save_objects(article)
Run Code Online (Sandbox Code Playgroud)
然后你可以逐步读取pickle文件,如下所示:
import pickle
with open('objects.pkl', 'rb') as pickle_file:
try:
while True:
article = pickle.load(pickle_file)
print article
except EOFError:
pass
Run Code Online (Sandbox Code Playgroud)
我能想到的选择是: