循环期间可用内存

Ali*_*MAR 10 python optimization performance json out-of-memory

我的代码中遇到内存错误.我的解析器可以总结如下:

# coding=utf-8
#! /usr/bin/env python
import sys
import json
from collections import defaultdict


class MyParserIter(object):

    def _parse_line(self, line):
        for couple in line.split(","):
            key, value = couple.split(':')[0], couple.split(':')[1]
            self.__hash[key].append(value)

    def __init__(self, line):
        # not the real parsing just a example to parse each
        # line to a dict-like obj
        self.__hash = defaultdict(list)
        self._parse_line(line)

    def __iter__(self):
        return iter(self.__hash.values())

    def to_dict(self):
        return self.__hash

    def __getitem__(self, item):
        return self.__hash[item]

    def free(self, item):
        self.__hash[item] = None

    def free_all(self):
        for k in self.__hash:
            self.free(k)

    def to_json(self):
        return json.dumps(self.to_dict())


def parse_file(file_path):
    list_result = []
    with open(file_path) as fin:
        for line in fin:
            parsed_line_obj = MyParserIter(line)
            list_result.append(parsed_line_obj)
    return list_result


def write_to_file(list_obj):
    with open("out.out", "w") as fout:
        for obj in list_obj:
            json_out = obj.to_json()
            fout.write(json_out + "\n")
            obj.free_all()
            obj = None

if __name__ == '__main__':
        result_list = parse_file('test.in')
        print(sys.getsizeof(result_list))
        write_to_file(result_list)
        print(sys.getsizeof(result_list))
        # the same result for memory usage result_list
        print(sys.getsizeof([None] * len(result_list)))
        # the result is not the same :(
Run Code Online (Sandbox Code Playgroud)

目的是解析(大)文件,每行转换为将写回文件的json对象.

我的目标是减少占用空间,因为在某些情况下,此代码会引发内存错误.每次fout.write我想删除(免费内存)obj参考.

我试图设置obj为无调用方法,obj.free_all()但没有一个释放内存.我还使用了simplejson而不是json,它减少了占用空间,但在某些情况下仍然太大.

test.in看起来像:

test1:OK,test3:OK,...
test1:OK,test3:OK,...
test1:OK,test3:OK,test4:test_again...
....
Run Code Online (Sandbox Code Playgroud)

Tom*_*zes 2

为了obj成为免费的,所有对它的引用都必须被删除。您的循环没有这样做,因为引用list_obj仍然存在。以下内容将解决该问题:

def write_to_file(list_obj):
    with open("out.out", "w") as fout:
        for ix in range(list_obj):
            obj = list_obj[ix]
            list_obj[ix] = None
            json_out = obj.to_json()
            fout.write(json_out + "\n")
            obj.free_all()
Run Code Online (Sandbox Code Playgroud)

或者,您可以破坏性地从 的前面弹出该元素list_obj,尽管如果必须重新分配list_obj太多次,这可能会导致性能问题。我还没有尝试过这个,所以我不太确定。该版本看起来像这样:

def write_to_file(list_obj):
    with open("out.out", "w") as fout:
        while len(list_obj) > 0:
            obj = list_obj.pop(0)
            json_out = obj.to_json()
            fout.write(json_out + "\n")
            obj.free_all()
Run Code Online (Sandbox Code Playgroud)