如何销毁Python对象并释放内存

Tha*_*eed 13 python memory-management out-of-memory pandas

我试图迭代超过100,000个图像并捕获一些图像功能,并将所得的dataFrame作为pickle文件存储在磁盘上。

不幸的是,由于RAM的限制,我被迫将图像分成20,000个大块并对其进行操作,然后再将结果保存到磁盘上。

在开始循环以处理下一个20,000图像之前,下面编写的代码应该保存20,000图像的结果数据框。

但是-这似乎没有解决我的问题,因为在第一个for循环结束时内存没有从RAM中释放

因此,在处理第50,000条记录时,该程序由于内存不足错误而崩溃。

在将对象保存到磁盘并调用垃圾收集器后,我尝试删除这些对象,但是RAM使用率似乎并未下降。

我想念什么?

#file_list_1 contains 100,000 images
file_list_chunks = list(divide_chunks(file_list_1,20000))
for count,f in enumerate(file_list_chunks):
    # make the Pool of workers
    pool = ThreadPool(64) 
    results = pool.map(get_image_features,f)
    # close the pool and wait for the work to finish 
    list_a, list_b = zip(*results)
    df = pd.DataFrame({'filename':list_a,'image_features':list_b})
    df.to_pickle("PATH_TO_FILE"+str(count)+".pickle")
    del list_a
    del list_b
    del df
    gc.collect()
    pool.close() 
    pool.join()
    print("pool closed")
Run Code Online (Sandbox Code Playgroud)

And*_*den 6

Now, it could be that something in the 50,000th is very large, and that's causing the OOM, so to test this I'd first try:

file_list_chunks = list(divide_chunks(file_list_1,20000))[30000:]
Run Code Online (Sandbox Code Playgroud)

If it fails at 10,000 this will confirm whether 20k is too big a chunksize, or if it fails at 50,000 again, there is an issue with the code...


Okay, onto the code...

Firstly, you don't need the explicit list constructor, it's much better in python to iterate rather than generate the entire the list into memory.

file_list_chunks = list(divide_chunks(file_list_1,20000))
# becomes
file_list_chunks = divide_chunks(file_list_1,20000)
Run Code Online (Sandbox Code Playgroud)

I think you might be misusing ThreadPool here:

Prevents any more tasks from being submitted to the pool. Once all the tasks have been completed the worker processes will exit.

This reads like close might have some thinks still running, although I guess this is safe it feels a little un-pythonic, it's better to use the context manager for ThreadPool:

with ThreadPool(64) as pool: 
    results = pool.map(get_image_features,f)
    # etc.
Run Code Online (Sandbox Code Playgroud)

The explicit dels in python aren't actually guaranteed to free memory.

You should collect after the join/after the with:

with ThreadPool(..):
    ...
    pool.join()
gc.collect()
Run Code Online (Sandbox Code Playgroud)

You could also try chunk this into smaller pieces e.g. 10,000 or even smaller!


Hammer 1

One thing, I would consider doing here, instead of using pandas DataFrames and large lists is to use a SQL database, you can do this locally with sqlite3:

import sqlite3
conn = sqlite3.connect(':memory:', check_same_thread=False)  # or, use a file e.g. 'image-features.db'
Run Code Online (Sandbox Code Playgroud)

and use context manager:

with conn:
    conn.execute('''CREATE TABLE images
                    (filename text, features text)''')

with conn:
    # Insert a row of data
    conn.execute("INSERT INTO images VALUES ('my-image.png','feature1,feature2')")
Run Code Online (Sandbox Code Playgroud)

That way, we won't have to handle the large list objects or DataFrame.

You can pass the connection to each of the threads... you might have to something a little weird like:

results = pool.map(get_image_features, zip(itertools.repeat(conn), f))
Run Code Online (Sandbox Code Playgroud)

Then, after the calculation is complete you can select all from the database, into which ever format you like. E.g. using read_sql.


Hammer 2

Use a subprocess here, rather than running this in the same instance of python "shell out" to another.

Since you can pass start and end to python as sys.args, you can slice these:

# main.py
# a for loop to iterate over this
subprocess.check_call(["python", "chunk.py", "0", "20000"])

# chunk.py a b
for count,f in enumerate(file_list_chunks):
    if count < int(sys.argv[1]) or count > int(sys.argv[2]):
         pass
    # do stuff
Run Code Online (Sandbox Code Playgroud)

That way, the subprocess will properly clean up python (there's no way there'll be memory leaks, since the process will be terminated).


My bet is that Hammer 1 is the way to go, it feels like you're gluing up a lot of data, and reading it into python lists unnecessarily, and using sqlite3 (or some other database) completely avoids that.


归档时间:

查看次数:

1446 次

最近记录:

6 年,9 月 前