use*_*827 2 python loops pandas
我有一个数据框,我执行一些操作并打印出来.要做到这一点,我必须遍历每一行.
for count, row in final_df.iterrows():
x = row['param_a']
y = row['param_b']
# Perform operation
# Write to output file
Run Code Online (Sandbox Code Playgroud)
我决定使用python多处理模块并行化这个
def write_site_files(row):
x = row['param_a']
y = row['param_b']
# Perform operation
# Write to output file
pkg_num = 0
total_runs = final_df.shape[0] # Total number of rows in final_df
threads = []
import multiprocessing
while pkg_num < total_runs or len(threads):
if(len(threads) < num_proc and pkg_num < total_runs):
print pkg_num, total_runs
t = multiprocessing.Process(target=write_site_files,args=[final_df.iloc[pkg_num],pkg_num])
pkg_num = pkg_num + 1
t.start()
threads.append(t)
else:
for thread in threads:
if not thread.is_alive():
threads.remove(thread)
Run Code Online (Sandbox Code Playgroud)
但是,后者(并行化)方法比基于简单迭代的方法慢.有什么我想念的吗?
谢谢!
这将是这样,除非实际操作中采取了很多的时间,好像秒效率较低,在一个单一的过程这样每行.
通常并行化是框中的最后一个工具.在进行分析之后,在局部向量化之后,在本地优化之后,然后进行并行化.
你花时间做切片,然后开始新的流程(这通常是一个不变的开销),然后腌制一行(不清楚你的例子有多大).
至少,你应该对行进行分块,例如df.iloc[i:(i+1)*chunksize]
.
希望apply
在0.14中支持并行,请看这里:https://github.com/pydata/pandas/issues/5751