泡菜文件太大无法加载

Vin*_*iri 17 python sql out-of-memory pickle

我遇到的问题是我有一个非常大的pickle文件(2.6 Gb),我试图打开,但每次我这样做,我得到一个内存错误.我现在意识到我应该使用数据库存储所有信息,但现在已经太晚了.pickle文件包含来自互联网的美国国会记录中的日期和文本(运行大约需要2周).有什么方法可以访问我逐步转储到pickle文件中的信息,或者将pickle文件转换为sql数据库或其他我无法重新输入所有数据的其他东西.我真的不想再花两周时间重新抓取国会记录并将数据输入数据库.

非常感谢你的帮助

编辑*

对象如何被腌制的代码:

def save_objects(objects): 
    with open('objects.pkl', 'wb') as output: 
        pickle.dump(objects, output, pickle.HIGHEST_PROTOCOL)

def Main():   
    Links()
    file = open("datafile.txt", "w")
    objects=[]
    with open('links2.txt', 'rb') as infile:
        for link in infile: 
            print link
            title,text,date=Get_full_text(link)
            article=Doccument(title,date,text)
            if text != None:
                write_to_text(date,text)
                objects.append(article)
                save_objects(objects)
Run Code Online (Sandbox Code Playgroud)

这是带错误的程序:

def Main():
    file= open('objects1.pkl', 'rb') 
    object = pickle.load(file)
Run Code Online (Sandbox Code Playgroud)

vge*_*gel 32

看起来你有点腌渍!;-).希望在此之后,你永远不会使用PICKLE EVER.它不是一个非常好的数据存储格式.

无论如何,对于这个答案,我假设你的Document课看起来有点像这样.如果没有,请评论您的实际Document课程:

class Document(object): # <-- object part is very important! If it's not there, the format is different!
    def __init__(self, title, date, text): # assuming all strings
        self.title = title
        self.date = date
        self.text = text
Run Code Online (Sandbox Code Playgroud)

无论如何,我用这个类做了一些简单的测试数据:

d = [Document(title='foo', text='foo is good', date='1/1/1'), Document(title='bar', text='bar is better', date='2/2/2'), Document(title='baz', text='no one likes baz :(', date='3/3/3')]
Run Code Online (Sandbox Code Playgroud)

用格式腌制2(pickle.HIGHEST_PROTOCOL适用于Python 2.x)

>>> s = pickle.dumps(d, 2)
>>> s
'\x80\x02]q\x00(c__main__\nDocument\nq\x01)\x81q\x02}q\x03(U\x04dateq\x04U\x051/1/1q\x05U\x04textq\x06U\x0bfoo is goodq\x07U\x05titleq\x08U\x03fooq\tubh\x01)\x81q\n}q\x0b(h\x04U\x052/2/2q\x0ch\x06U\rbar is betterq\rh\x08U\x03barq\x0eubh\x01)\x81q\x0f}q\x10(h\x04U\x053/3/3q\x11h\x06U\x13no one likes baz :(q\x12h\x08U\x03bazq\x13ube.'
Run Code Online (Sandbox Code Playgroud)

并用以下方式拆卸它pickletools:

>>> pickletools.dis(s)
    0: \x80 PROTO      2
    2: ]    EMPTY_LIST
    3: q    BINPUT     0
    5: (    MARK
    6: c        GLOBAL     '__main__ Document'
   25: q        BINPUT     1
   27: )        EMPTY_TUPLE
   28: \x81     NEWOBJ
   29: q        BINPUT     2
   31: }        EMPTY_DICT
   32: q        BINPUT     3
   34: (        MARK
   35: U            SHORT_BINSTRING 'date'
   41: q            BINPUT     4
   43: U            SHORT_BINSTRING '1/1/1'
   50: q            BINPUT     5
   52: U            SHORT_BINSTRING 'text'
   58: q            BINPUT     6
   60: U            SHORT_BINSTRING 'foo is good'
   73: q            BINPUT     7
   75: U            SHORT_BINSTRING 'title'
   82: q            BINPUT     8
   84: U            SHORT_BINSTRING 'foo'
   89: q            BINPUT     9
   91: u            SETITEMS   (MARK at 34)
   92: b        BUILD
   93: h        BINGET     1
   95: )        EMPTY_TUPLE
   96: \x81     NEWOBJ
   97: q        BINPUT     10
   99: }        EMPTY_DICT
  100: q        BINPUT     11
  102: (        MARK
  103: h            BINGET     4
  105: U            SHORT_BINSTRING '2/2/2'
  112: q            BINPUT     12
  114: h            BINGET     6
  116: U            SHORT_BINSTRING 'bar is better'
  131: q            BINPUT     13
  133: h            BINGET     8
  135: U            SHORT_BINSTRING 'bar'
  140: q            BINPUT     14
  142: u            SETITEMS   (MARK at 102)
  143: b        BUILD
  144: h        BINGET     1
  146: )        EMPTY_TUPLE
  147: \x81     NEWOBJ
  148: q        BINPUT     15
  150: }        EMPTY_DICT
  151: q        BINPUT     16
  153: (        MARK
  154: h            BINGET     4
  156: U            SHORT_BINSTRING '3/3/3'
  163: q            BINPUT     17
  165: h            BINGET     6
  167: U            SHORT_BINSTRING 'no one likes baz :('
  188: q            BINPUT     18
  190: h            BINGET     8
  192: U            SHORT_BINSTRING 'baz'
  197: q            BINPUT     19
  199: u            SETITEMS   (MARK at 153)
  200: b        BUILD
  201: e        APPENDS    (MARK at 5)
  202: .    STOP
Run Code Online (Sandbox Code Playgroud)

看起来很复杂 但实际上,它并没有那么糟糕.pickle基本上是一个堆栈机器,你看到的每个ALL_CAPS标识符都是一个操作码,它以某种方式操作内部"堆栈"进行解码.如果我们试图解析一些复杂的结构,这将更为重要,但幸运的是,我们只是制作一个基本元组的简单列表.所有这些"代码"正在做的是在堆栈上构造一堆对象,然后将整个堆栈推入列表中.

我们需要关心的一件事是你看到散落的'BINPUT'/'BINGET'操作码.基本上,这些是用于"memoization",减少数据占用空间,pickle保存字符串BINPUT <id>,然后如果它们再次出现,而不是重新转储它们,只需放入一个BINGET <id>从缓存中检索它们.

另外,另一个复杂功能!不仅仅是SHORT_BINSTRING- BINSTRING字符串> 256字节是正常的,还有一些有趣的unicode变体.我只是假设你使用Python 2和所有ASCII字符串.再次,如果这不是一个正确的假设,请评论.

好的,所以我们需要流式传输文件,直到我们点击'\ 81'字节(NEWOBJ).然后,我们需要向前扫描直到我们点击'('(MARK)字符.然后,直到我们点击'u'(SETITEMS),我们读取键/值字符串对 - 总共应该有3对,每个字段一个.

所以,让我们这样做.这是我以流媒体方式阅读pickle数据的脚本.它远非完美,因为我只是为了这个答案而将它一起攻击,你需要对它进行很多修改,但这是一个好的开始.

pickledata = '\x80\x02]q\x00(c__main__\nDocument\nq\x01)\x81q\x02}q\x03(U\x04dateq\x04U\x051/1/1q\x05U\x04textq\x06U\x0bfoo is goodq\x07U\x05titleq\x08U\x03fooq\tubh\x01)\x81q\n}q\x0b(h\x04U\x052/2/2q\x0ch\x06T\x14\x05\x00\x00bar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterq\rh\x08U\x03barq\x0eubh\x01)\x81q\x0f}q\x10(h\x04U\x053/3/3q\x11h\x06U\x13no one likes baz :(q\x12h\x08U\x03bazq\x13ube.'

# simulate a file here
import StringIO
picklefile = StringIO.StringIO(pickledata)

import pickle # just for opcode names
import struct # binary unpacking

def try_memo(f, v, cache):
    opcode = f.read(1)
    if opcode == pickle.BINPUT:
        cache[f.read(1)] = v
    elif opcode == pickle.LONG_BINPUT:
        print 'skipping LONG_BINPUT to save memory, LONG_BINGET will probably not be used'
        f.read(4)
    else:
        f.seek(f.tell() - 1) # rewind

def try_read_string(f, opcode, cache):
    if opcode in [ pickle.SHORT_BINSTRING, pickle.BINSTRING ]:
        length_type = 'b' if opcode == pickle.SHORT_BINSTRING else 'i'
        str_length = struct.unpack(length_type, f.read(struct.calcsize(length_type)))[0]
        value = f.read(str_length)
        try_memo(f, value, memo_cache)
        return value
    elif opcode == pickle.BINGET:
        return memo_cache[f.read(1)]
    elif opcide == pickle.LONG_BINGET:
        raise Exception('Unexpected LONG_BINGET? Key ' + f.read(4))
    else:
        raise Exception('Invalid opcode ' + opcode + ' at pos ' + str(f.tell()))

memo_cache = {}
while True:
    c = picklefile.read(1)
    if c == pickle.NEWOBJ:
        while picklefile.read(1) != pickle.MARK:
            pass # scan forward to field instantiation
        fields = {}
        while True:
            opcode = picklefile.read(1)
            if opcode == pickle.SETITEMS:
                break
            key = try_read_string(picklefile, opcode, memo_cache)
            value = try_read_string(picklefile, picklefile.read(1), memo_cache)
            fields[key] = value
        print 'Document', fields
        # insert to sqllite
    elif c == pickle.STOP:
        break
Run Code Online (Sandbox Code Playgroud)

这正确地以pickle格式2读取我的测试数据(修改为具有长字符串):

$ python picklereader.py
Document {'date': '1/1/1', 'text': 'foo is good', 'title': 'foo'}
Document {'date': '2/2/2', 'text': 'bar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is better', 'title': 'bar'}
Document {'date': '3/3/3', 'text': 'no one likes baz :(', 'title': 'baz'}
Run Code Online (Sandbox Code Playgroud)

祝好运!


Rob*_*obᵩ 9

您没有逐步挑选数据.你整体地反复腌制你的数据.每次循环时,都会销毁你拥有的任何输出数据(open(...,'wb')销毁输出文件),并再次重写所有数据.此外,如果您的程序曾停止然后使用新输入数据重新启动,则旧输出数据将丢失.

我不知道为什么objects在你腌制时没有引起内存不足的错误,因为它增长到与pickle.load()想要创建的对象相同的大小.

以下是如何以增量方式创建pickle文件:

def save_objects(objects): 
    with open('objects.pkl', 'ab') as output:  # Note: `ab` appends the data
        pickle.dump(objects, output, pickle.HIGHEST_PROTOCOL)

def Main():
    ...
    #objects=[] <-- lose the objects list
    with open('links2.txt', 'rb') as infile:
        for link in infile: 
            ... 
            save_objects(article)
Run Code Online (Sandbox Code Playgroud)

然后你可以逐步读取pickle文件,如下所示:

import pickle
with open('objects.pkl', 'rb') as pickle_file:
    try:
        while True:
            article = pickle.load(pickle_file)
            print article
    except EOFError:
        pass
Run Code Online (Sandbox Code Playgroud)

我能想到的选择是:

  • 试试cPickle.它可能有所帮助.
  • 尝试流式泡菜
  • 在具有大量RAM的64位环境中读取您的pickle文件
  • 重新抓取原始数据,这次实际上是以递增方式存储数据,或将其存储在数据库中.如果没有不断重写您的pickle输出文件的低效率,这次您的爬行可能会明显加快.

  • 哦,那么答案显而易见:从原始计算机上的pickle文件中提取数据. (5认同)