Men*_*kes 5 python gzip python-2.7
我有多个 .gz 文件,总计达 1TB。如何利用 Python 2.7 并行解压这些文件?循环处理文件需要太多时间。
我也尝试过这段代码:
filenames = [gz for gz in glob.glob(filesFolder + '*.gz')]
def uncompress(path):
with gzip.open(path, 'rb') as src, open(path.rstrip('.gz'), 'wb') as dest:
shutil.copyfileobj(src, dest)
with multiprocessing.Pool() as pool:
for _ in pool.imap_unordered(uncompress, filenames, chunksize=1):
pass
Run Code Online (Sandbox Code Playgroud)
但是我收到以下错误:
with multiprocessing.Pool() as pool:
AttributeError: __exit__
Run Code Online (Sandbox Code Playgroud)
谢谢!
要使用withconstruct,内部使用的对象必须有__enter__和__exit__方法。该错误表明该类Pool(或实例)没有这些,因此您不能在with语句中使用它。试试这个(刚刚删除 with 语句):
import glob, multiprocessing, shutil
filenames = [gz for gz in glob.glob('.' + '*.gz')]
def uncompress(path):
with gzip.open(path, 'rb') as src, open(path.rstrip('.gz'), 'wb') as dest:
shutil.copyfileobj(src, dest)
for _ in multiprocessing.Pool().imap_unordered(uncompress, filenames, chunksize=1):
pass
Run Code Online (Sandbox Code Playgroud)
编辑
我同意@dhke,除非所有(或大多数)gz 文件在物理上相邻,否则与逐个文件执行这些操作相比,对不同位置的频繁磁盘读取(在使用多处理时更频繁地调用)将会更慢一个(连续的)。