Python - 如何并行使用和操作目录中的文件

Question

Python - 如何并行使用和操作目录中的文件

当前场景:我在名为directoryA的目录中有900个文件.这些文件通过文件899.txt命名为file0.txt,每个文件大小为15MB.我在python中按顺序遍历每个文件.我将每个文件作为列表加载,执行一些操作,并在directoryB中写出输出文件.当循环结束时,我在目录B中有900个文件.这些文件通过out899.csv命名为out0.csv.

问题:每个文件的处理需要3分钟,使脚本运行超过40小时.我希望以并行方式运行该过程,因为所有文件彼此独立(没有任何相互依赖性).我的机器里有12个核心.

以下脚本按顺序运行.请帮我平行运行.我已经使用相关的stackoverflow问题查看了python中的一些并行处理模块,但是我很难理解,因为我没有太多的python接触.万分感谢.

伪脚本

    from os import listdir 
    import csv

    mypath = "some/path/"

    inputDir = mypath + 'dirA/'
    outputDir = mypath + 'dirB/'

    for files in listdir(inputDir):
        #load the text file as list using csv module 
        #run a bunch of operations
        #regex the int from the filename. for ex file1.txt returns 1, and file42.txt returns 42
        #write out a corresponsding csv file in dirB. For example input file file99.txt is written as out99.csv

Run Code Online (Sandbox Code Playgroud)

Answer 1

小智 11

要充分利用硬件核心,最好使用多处理库.

from multiprocessing import Pool

from os import listdir 
import csv

def process_file(file):
    #load the text file as list using csv module 
    #run a bunch of operations
    #regex the int from the filename. for ex file1.txt returns 1, and file42.txt returns 42
    #write out a corresponsding csv file in dirB. For example input file file99.txt is written as out99.csv

if __name__ == '__main__':
    mypath = "some/path/"

    inputDir = mypath + 'dirA/'
    outputDir = mypath + 'dirB/'

    p = Pool(12)
    p.map(process_file, listdir(inputDir))

Run Code Online (Sandbox Code Playgroud)

多处理文档:https: //docs.python.org/2/library/multiprocessing.html

我们需要在最后添加 p.join() 或 p.close() 吗？ (2认同)

归档时间：	10 年，6 月前
查看次数：	3254 次
最近记录：	10 年，6 月前