跨数据帧映射时多处理池挂起?

Pyt*_*oob 5 python multithreading numpy pandas

我正在尝试将 Pandas 数据帧拆分为多个块,然后在并行中的每个块上运行一个函数(基于此示例)。常规的非分块版本工作得很好(慢),但由于某种原因,分块版本完全失败:池在 CPU 使用率为 0% 时挂起,脚本永远不会完成。如果有人愿意建议为什么这不起作用,我整理了一个快速可重复的示例?

import pandas as pd
from multiprocessing import Pool
import numpy as np
import time

def samplefunction(dfinputlist):
    dfinputlist=dfinputlist*2
    return dfinputlist

def parallelize_dataframe(df, func):
    df_split = np.array_split(df, 2)
    pool = Pool(4)
    df = pd.concat(pool.map(func, df_split))
    pool.close()
    pool.join()
    return df

if __name__ == "__main__":
    dfinputlist = pd.DataFrame(np.random.randint(0,50,size=(100000000, 4)), columns=list('ABCD'))
    start=time.time()
    dfinputlist=samplefunction(dfinputlist)
    print('Finished Non-Parrallel Version after '+ str(time.time()-start)+' seconds.')
    start=time.time()
    output=parallelize_dataframe(dfinputlist, samplefunction)
    print('Finished Parrallel Version after '+ str(time.time()-start)+' seconds.')
Run Code Online (Sandbox Code Playgroud)