Mut*_* Rg 13 python parallel-processing multiprocessing
我正在尝试处理文件(每行都是一个json文档).文件的大小可以达到mbs到100的mbs.所以我写了一个生成器代码来逐行从文件中获取每个文档.
def jl_file_iterator(file):
with codecs.open(file, 'r', 'utf-8') as f:
for line in f:
document = json.loads(line)
yield document
Run Code Online (Sandbox Code Playgroud)
我的系统有4个核心,所以我想并行处理4行文件.目前我有这个代码,一次需要4行,并调用代码进行并行处理
threads = 4
files, i = [], 1
for jl in jl_file_iterator(input_path):
files.append(jl)
if i % (threads) == 0:
# pool.map(processFile, files)
parallelProcess(files, o)
files = []
i += 1
if files:
parallelProcess(files, o)
files = []
Run Code Online (Sandbox Code Playgroud)
这是我的代码,实际处理发生
def parallelProcess(files, outfile):
processes = []
for i in range(len(files)):
p = Process(target=processFile, args=(files[i],))
processes.append(p)
p.start()
for i in range(len(files)):
processes[i].join()
def processFile(doc):
extractors = {}
... do some processing on doc
o.write(json.dumps(doc) + '\n')
Run Code Online (Sandbox Code Playgroud)
正如您所看到的,在我发送接下来的4个文件进行处理之前,我等待所有4行完成处理.但我想要做的是,只要一个进程完成处理文件,我就想开始下一行分配给已重新处理的处理器.我怎么做?
PS:问题是因为它是一个生成器我无法加载所有文件并使用map之类的东西来运行进程.
谢谢你的帮助
Tim*_*ers 14
正如@pvg在评论中所说的那样,(有界)队列是以不同速度在生产者和消费者之间进行调解的自然方式,确保他们尽可能地保持忙碌但不让生产者领先.
这是一个独立的可执行示例.队列限制为最大大小等于工作进程数.如果消费者的运行速度比生产者快得多,那么让队列变得更大就更有意义了.
在您的具体情况下,将行传递给消费者并让他们document = json.loads(line)并行执行该部分可能是有意义的.
import multiprocessing as mp
NCORE = 4
def process(q, iolock):
from time import sleep
while True:
stuff = q.get()
if stuff is None:
break
with iolock:
print("processing", stuff)
sleep(stuff)
if __name__ == '__main__':
q = mp.Queue(maxsize=NCORE)
iolock = mp.Lock()
pool = mp.Pool(NCORE, initializer=process, initargs=(q, iolock))
for stuff in range(20):
q.put(stuff) # blocks until q below its max size
with iolock:
print("queued", stuff)
for _ in range(NCORE): # tell workers we're done
q.put(None)
pool.close()
pool.join()
Run Code Online (Sandbox Code Playgroud)
因此,我最终成功运行了此程序。通过从我的文件创建几行代码并并行运行这些行。将其发布在此处,以便将来对某人有用。
def run_parallel(self, processes=4):
processes = int(processes)
pool = mp.Pool(processes)
try:
pool = mp.Pool(processes)
jobs = []
# run for chunks of files
for chunkStart,chunkSize in self.chunkify(input_path):
jobs.append(pool.apply_async(self.process_wrapper,(chunkStart,chunkSize)))
for job in jobs:
job.get()
pool.close()
except Exception as e:
print e
def process_wrapper(self, chunkStart, chunkSize):
with open(self.input_file) as f:
f.seek(chunkStart)
lines = f.read(chunkSize).splitlines()
for line in lines:
document = json.loads(line)
self.process_file(document)
# Splitting data into chunks for parallel processing
def chunkify(self, filename, size=1024*1024):
fileEnd = os.path.getsize(filename)
with open(filename,'r') as f:
chunkEnd = f.tell()
while True:
chunkStart = chunkEnd
f.seek(size,1)
f.readline()
chunkEnd = f.tell()
yield chunkStart, chunkEnd - chunkStart
if chunkEnd > fileEnd:
break
Run Code Online (Sandbox Code Playgroud)
蒂姆·彼得斯的回答很好。
但我的具体情况略有不同,我必须修改他的答案以满足我的需要。参考这里。
这回答了评论中@CpILL 的问题。
就我而言,我使用了一系列生成器(来创建管道)。
在这一系列生成器中,其中一个生成器正在执行繁重的计算,从而减慢了整个管道的速度。
像这样的东西:
def fast_generator1():
for line in file:
yield line
def slow_generator(lines):
for line in lines:
yield heavy_processing(line)
def fast_generator2():
for line in lines:
yield fast_func(line)
if __name__ == "__main__":
lines = fast_generator1()
lines = slow_generator(lines)
lines = fast_generator2(lines)
for line in lines:
print(line)
Run Code Online (Sandbox Code Playgroud)
为了使其更快,我们必须使用多个进程来执行慢速生成器。
修改后的代码如下所示:
import multiprocessing as mp
NCORE = 4
def fast_generator1():
for line in file:
yield line
def slow_generator(lines):
def gen_to_queue(input_q, lines):
# This function simply consume our generator and write it to the input queue
for line in lines:
input_q.put(line)
for _ in range(NCORE): # Once generator is consumed, send end-signal
input_q.put(None)
def process(input_q, output_q):
while True:
line = input_q.get()
if line is None:
output_q.put(None)
break
output_q.put(heavy_processing(line))
input_q = mp.Queue(maxsize=NCORE * 2)
output_q = mp.Queue(maxsize=NCORE * 2)
# Here we need 3 groups of worker :
# * One that will consume the input generator and put it into a queue. It will be `gen_pool`. It's ok to have only 1 process doing this, since this is a very light task
# * One that do the main processing. It will be `pool`.
# * One that read the results and yield it back, to keep it as a generator. The main thread will do it.
gen_pool = mp.Pool(1, initializer=gen_to_queue, initargs=(input_q, lines))
pool = mp.Pool(NCORE, initializer=process, initargs=(input_q, output_q))
finished_workers = 0
while True:
line = output_q.get()
if line is None:
finished_workers += 1
if finished_workers == NCORE:
break
else:
yield line
def fast_generator2():
for line in lines:
yield fast_func(line)
if __name__ == "__main__":
lines = fast_generator1()
lines = slow_generator(lines)
lines = fast_generator2(lines)
for line in lines:
print(line)
Run Code Online (Sandbox Code Playgroud)
通过此实现,我们有了一个多进程生成器:它的使用方式与其他生成器完全相同(如本答案的第一个示例中所示),但所有繁重的计算都是使用多处理完成的,从而加速了它!
| 归档时间: |
|
| 查看次数: |
9681 次 |
| 最近记录: |