切片numpy数组时减少内存使用量

jdb*_*ody 1 python linux garbage-collection memory-leaks numpy

我在使用Python释放内存时遇到了麻烦.情况基本上是这样的:我将一个大型数据集拆分为4个文件.每个文件包含5000个numpy形状阵列(3072,412)的列表.我试图将每个数组的第10到第20列提取到一个新列表中.

我想要做的是顺序读取每个文件,提取我需要的数据,并释放我正在使用的内存,然后再继续下一个.但是,删除对象,将其设置为无并将其设置为0,然后调用gc.collect()似乎不起作用.这是我正在使用的代码片段:

num_files=4
start=10
end=20           
fields = []
for j in range(num_files):
    print("Working on file ", j)
    source_filename = base_filename + str(j) + ".pkl"
    print("Memory before: ", psutil.virtual_memory())
    partial_db = joblib.load(source_filename)
    print("GC tracking for partial_db is ",gc.is_tracked(partial_db))
    print("Memory after loading partial_db:",psutil.virtual_memory())
    for x in partial_db:
        fields.append(x[:,start:end])
    print("Memory after appending to fields: ",psutil.virtual_memory())
    print("GC Counts before del: ", gc.get_count())
    partial_db = None
    print("GC Counts after del: ", gc.get_count())
    gc.collect()
    print("GC Counts after collection: ", gc.get_count())
    print("Memory after freeing partial_db: ", psutil.virtual_memory())
Run Code Online (Sandbox Code Playgroud)

这是几个文件后的输出:

Working on file  0
Memory before:  svmem(total=67509161984, available=66177449984,percent=2.0, used=846712832, free=33569669120, active=27423051776, inactive=5678043136, buffers=22843392, cached=33069936640, shared=15945728)
GC tracking for partial_db is  True
Memory after loading partial_db:  svmem(total=67509161984, available=40785944576, percent=39.6, used=26238181376, free=8014237696, active=54070542336, inactive=4540620800, buffers=22892544, cached=33233850368, shared=15945728)
Memory after appending to fields:  svmem(total=67509161984, available=40785944576, percent=39.6, used=26238181376, free=8014237696, active=54070542336, inactive=4540620800, buffers=22892544, cached=33233850368, shared=15945728)
GC Counts before del:  (0, 7, 3)
GC Counts after del:  (0, 7, 3)
GC Counts after collection:  (0, 0, 0)
Memory after freeing partial_db:  svmem(total=67509161984, available=40785944576, percent=39.6, used=26238181376, free=8014237696, active=54070542336, inactive=4540620800, buffers=22892544, cached=33233850368, shared=15945728)
Working on file  1
Memory before:  svmem(total=67509161984, available=40785944576, percent=39.6, used=26238181376, free=8014237696, active=54070542336, inactive=4540620800, buffers=22892544, cached=33233850368, shared=15945728)
GC tracking for partial_db is  True
Memory after loading partial_db:  svmem(total=67509161984, available=15378006016, percent=77.2, used=51626561536, free=265465856, active=62507155456, inactive=3761905664, buffers=10330112, cached=15606804480, shared=15945728)
Memory after appending to fields:  svmem(total=67509161984, available=15378006016, percent=77.2, used=51626561536, free=265465856, active=62507155456, inactive=3761905664, buffers=10330112, cached=15606804480, shared=15945728)
GC Counts before del:  (0, 4, 2)
GC Counts after del:  (0, 4, 2)
GC Counts after collection:  (0, 0, 0)
Memory after freeing partial_db:  svmem(total=67509161984, available=15378006016, percent=77.2, used=51626561536, free=265465856, active=62507155456, inactive=3761905664, buffers=10330112, cached=15606804480, shared=15945728)
Run Code Online (Sandbox Code Playgroud)

如果我继续放下它会耗尽所有内存并触发MemoryError异常.

任何人都知道我可以做些什么来确保partial_db释放的数据被释放?

aba*_*ert 8

问题是这样的:

for x in partial_db:
    fields.append(x[:,start:end])
Run Code Online (Sandbox Code Playgroud)

切片numpy数组(与普通Python列表不同)的原因几乎没有时间,没有浪费的空间是它没有复制,它只是创建了数组内存的另一个视图.通常,这很好.但是在这里,它意味着x即使在你释放之后你也会保持记忆x,因为你永远不会释放那些切片.

还有其他方法,但最简单的方法是只附加切片的副本:

for x in partial_db:
    fields.append(x[:,start:end].copy())
Run Code Online (Sandbox Code Playgroud)