Jor*_*lva 6 python parallel-processing multiprocessing python-multiprocessing
我一直在尝试使用python中的多处理模块来实现计算成本高昂的任务的并行性.
我能够执行我的代码,但它并不是并行运行的.我一直在阅读多处理的手册页和foruns,以找出它为什么不工作,我还没有想出来.
我认为这个问题可能与执行我创建和导入的其他模块的某种锁有关.
这是我的代码:
main.py:
##import my modules
import prepare_data
import filter_part
import wrapper_part
import utils
from myClasses import ML_set
from myClasses import data_instance
n_proc = 5
def main():
if __name__ == '__main__':
##only main process should run this
data = prepare_data.import_data() ##read data from file
data = prepare_data.remove_and_correct_outliers(data)
data = prepare_data.normalize_data_range(data)
features = filter_part.filter_features(data)
start_t = time.time()
##parallelism will be used on this part
best_subset = wrapper_part.wrapper(n_proc, data, features)
print time.time() - start_t
main()
Run Code Online (Sandbox Code Playgroud)
wrapper_part.py:
##my modules
from myClasses import ML_set
from myClasses import data_instance
import utils
def wrapper(n_proc, data, features):
p_work_list = utils.divide_features(n_proc-1, features)
n_train, n_test = utils.divide_data(data)
workers = []
for i in range(0,n_proc-1):
print "sending process:", i
p = mp.Process(target=worker_classification, args=(i, p_work_list[i], data, features, n_train, n_test))
workers.append(p)
p.start()
for worker in workers:
print "waiting for join from worker"
worker.join()
return
def worker_classification(id, work_list, data, features, n_train, n_test):
print "Worker ", id, " starting..."
best_acc = 0
best_subset = []
while (work_list != []):
test_subset = work_list[0]
del(work_list[0])
train_set, test_set = utils.cut_dataset(n_train, n_test, data, test_subset)
_, acc = classification_decision_tree(train_set, test_set)
if acc > best_acc:
best_acc = acc
best_subset = test_subset
print id, " found best subset -> ", best_subset, " with accuracy: ", best_acc
Run Code Online (Sandbox Code Playgroud)
所有其他模块都不使用多处理模块并且工作正常.在这个阶段,我只是测试并行测试,甚至没有尝试获得结果,因此在进程和共享内存变量之间没有任何通信.每个进程都使用一些变量,但它们是在产生进程之前定义的,所以据我所知,我相信每个进程都有自己的变量副本.
作为5个进程的输出,我得到了这个:
importing data from file...
sending process: 0
sending process: 1
Worker 0 starting...
0 found best subset -> [2313] with accuracy: 60.41
sending process: 2
Worker 1 starting...
1 found best subset -> [3055] with accuracy: 60.75
sending process: 3
Worker 2 starting...
2 found best subset -> [3977] with accuracy: 62.8
waiting for join from worker
waiting for join from worker
waiting for join from worker
waiting for join from worker
Worker 3 starting...
3 found best subset -> [5770] with accuracy: 60.07
55.4430000782
Run Code Online (Sandbox Code Playgroud)
4个进程执行并行部分大约需要55秒.仅使用1个进程对此进行测试,执行时间为16秒:
importing data from file...
sending process: 0
waiting for join from worker
Worker 0 starting...
0 found best subset -> [5870] with accuracy: 63.32
16.4409999847
Run Code Online (Sandbox Code Playgroud)
我在python 2.7和Windows 8上运行它
编辑
我在ubuntu上测试了我的代码并且它工作正常,我猜它与Windows 8和python有关.这是ubuntu的输出:
importing data from file...
size trainset: 792 size testset: 302
sending process: 0
sending process: 1
Worker 0 starting...
sending process: 2
Worker 1 starting...
sending process: 3
Worker 2 starting...
waiting for join from worker
Worker 3 starting...
2 found best subset -> [5199] with accuracy: 60.93
1 found best subset -> [3198] with accuracy: 60.93
0 found best subset -> [1657] with accuracy: 61.26
waiting for join from worker
waiting for join from worker
waiting for join from worker
3 found best subset -> [5985] with accuracy: 62.25
6.1428809166
Run Code Online (Sandbox Code Playgroud)
我将从现在开始使用ubuntu进行测试,但是我想知道为什么代码在Windows上不起作用.
请务必阅读multiprocessing手册中的 Windows 指南:https ://docs.python.org/2/library/multiprocessing.html#windows
特别是“主模块的安全导入”:
\n\n\n\n\n相反,应该通过使用\n 来保护程序的\xe2\x80\x9centry point\xe2\x80\x9d,
\nif __name__ == \'__main__\':如下所示:
您在上面显示的第一个代码片段中违反了此规则,因此我没有再进一步查看。希望您所观察到的问题的解决方案就像包含此保护一样简单。
\n\n这一点很重要的原因是:在类 Unix 系统上,子进程是通过分叉创建的。在这种情况下,操作系统会创建创建分叉的进程的精确副本。也就是说,所有状态都由子级从父级继承。例如,这意味着所有函数和类都已定义。
\n\n在 Windows 上,没有这样的系统调用。Python 需要执行相当繁重的任务,在子进程中创建一个新的 Python 解释器会话,并重新创建(逐步)父进程的状态。例如,所有函数和类都需要重新定义。这就是为什么重型import机器在 Windows 上的 Python 多处理子引擎的引擎盖下运行。当子模块导入主模块时,这个机制就会启动。main()就您而言,这意味着对孩子的召唤!当然,您不希望这样。
你可能会觉得这很乏味。令我印象深刻的是,该模块设法为两个截然不同的平台multiprocessing提供相同功能的接口。实际上,就进程处理而言,符合 POSIX 的操作系统和 Windows 非常不同,因此很难提出一个适用于这两种操作系统的抽象。