大数据结构中的Python内存泄漏(列表,dicts) - 可能是什么原因？

Question

大数据结构中的Python内存泄漏(列表,dicts) - 可能是什么原因？

sos*_*ial 7 python memory-leaks memory-management python-3.x

代码非常简单.它不应该有任何泄漏,因为所有都是在函数内部完成的.并且不返回任何结果.我有一个函数遍历文件中的所有行(~20 MiB)并将它们全部放入列表中.
提到的功能:

def read_art_file(filename, path_to_dir):
    import codecs
    corpus = []
    corpus_file = codecs.open(path_to_dir + filename, 'r', 'iso-8859-15')
    newline = corpus_file.readline().strip()
    while newline != '':
        # we put into @article a @newline of file and some other info
        # (i left those lists blank for readability)
        article = [newline, [], [], [], [], [], [], [], [], [], [], [], []]
        corpus.append(article)
        del newline
        del article
        newline = corpus_file.readline().strip()
    memory_usage('inside function')
    for article in corpus:
        for word in article:
            del word
        del article
    del corpus
    corpus_file.close()
    memory_usage('inside: after corp deleted')
    return

Run Code Online (Sandbox Code Playgroud)

这是主要代码:

memory_usage('START')
path_to_dir = '/home/soshial/internship/training_data/parser_output/'
read_art_file('accounting.n.txt.wpr.art', path_to_dir)
memory_usage('outside func')
time.sleep(5)
memory_usage('END')

Run Code Online (Sandbox Code Playgroud)

所有memory_usage只是打印脚本分配的KiB数量.

执行脚本

如果我运行脚本,它会给我:

开始内存:6088 KiB
内存:393752 KiB(20 MiB文件+列表占用400 MiB)
里面:公司删除内存:43360 KiB
外部功能内存:34300 KiB(34300-6088 = 28 MiB泄露)结束
内存:34300 KiB

执行没有列表

如果我这样做完全同样的事情,但附加article的corpus注释:

article = [newline, [], [], [], [], [], ...]  # we still assign data to `article`
# corpus.append(article)  # we don't have this string during second execution

Run Code Online (Sandbox Code Playgroud)

这样输出给了我:

开始记忆:6076 KiB
内存:6076 KiB
里面:公司删除记忆后:6076 KiB
外部记忆:6076 KiB
FINISH记忆:6076 KiB

题:

因此,这样就可以释放所有内存.我需要释放所有内存,因为我要处理数百个这样的文件.
是我做错了还是CPython解释器错误？

UPD.这是我检查内存消耗的方法(取自其他一些stackoverflow问题):

def memory_usage(text = ''):
    """Memory usage of the current process in kilobytes."""
    status = None
    result = {'peak': 0, 'rss': 0}
    try:
        # This will only work on systems with a /proc file system
        # (like Linux).
        status = open('/proc/self/status')
        for line in status:
            parts = line.split()
            key = parts[0][2:-1].lower()
            if key in result:
                result[key] = int(parts[1])
    finally:
        if status is not None:
            status.close()
    print('>', text, 'memory:', result['rss'], 'KiB  ')
    return

Run Code Online (Sandbox Code Playgroud)

Answer 1

mgi*_*son 6

请注意，python 永远不能保证您代码使用的任何内存实际上都会返回给操作系统。垃圾收集所保证的全部是，已收集的对象使用的内存可以在将来的某个时间自由地供另一个对象使用。

从我读到的有关内存分配器的Cpython实现的¹讲，内存被分配到“池”中以提高效率。当池已满时，python将分配一个新池。如果一个池仅包含死对象，则Cpython实际上释放与该池关联的内存，否则不释放。这可能导致在某个函数执行完之后，多个半满的池继续闲逛。但是，这并不意味着它是“内存泄漏”。（Cpython仍然了解内存，并有可能在以后释放它）。

^{¹我不是python开发人员，因此这些详细信息可能不正确或至少不完整}

归档时间：	12 年，4 月前
查看次数：	5013 次
最近记录：	8 年，9 月前