同时运行两个 dask-ml 输入器而不是依次运行

ps0*_*604 6 python pandas dask dask-ml

我可以像这样使用 dask-ml 估算平均值和最频繁的值,这很好用:

mean_imputer = impute.SimpleImputer(strategy='mean')
most_frequent_imputer = impute.SimpleImputer(strategy='most_frequent')
data = [[100, 2, 5], [np.nan, np.nan, np.nan], [70, 7, 5]]
df = pd.DataFrame(data, columns = ['Weight', 'Age', 'Height']) 
df.iloc[:, [0,1]] = mean_imputer.fit_transform(df.iloc[:,[0,1]])
df.iloc[:, [2]] = most_frequent_imputer.fit_transform(df.iloc[:,[2]])
print(df)


    Weight  Age   Height
0   100.0   2.0   5.0
1   85.0    4.5   5.0
2   70.0    7.0   5.0
Run Code Online (Sandbox Code Playgroud)

但是,如果我有 1 亿行数据,那么当 dask 可以只执行一个循环时,它似乎会执行两个循环,是否可以同时和/或并行而不是按顺序运行两个输入器?实现这一目标的示例代码是什么?

Avi*_*ana 2

如果实体彼此独立,您可以按照文档和Dask 教程中的建议使用dask.delayed来并行计算。

你的代码看起来像:

from dask.distributed import Client

client = Client(n_workers=4)

from dask import delayed
import numpy as np
import pandas as pd
from dask_ml import impute

mean_imputer = impute.SimpleImputer(strategy='mean')
most_frequent_imputer = impute.SimpleImputer(strategy='most_frequent')

def fit_transform_mi(d):
    return mean_imputer.fit_transform(d)
def fit_transform_mfi(d):
    return most_frequent_imputer.fit_transform(d)
def setdf(a,b,df):
    df.iloc[:, [0,1]]=a
    df.iloc[:, [2]]=b
    return df

data = [[100, 2, 5], [np.nan, np.nan, np.nan], [70, 7, 5]]
df = pd.DataFrame(data, columns = ['Weight', 'Age', 'Height']) 
a = delayed(fit_transform_mi)(df.iloc[:,[0,1]])
b = delayed(fit_transform_mfi)(df.iloc[:,[2]])
c = delayed(setdf)(a,b,df)
df= c.compute()
print(df)
client.close()
Run Code Online (Sandbox Code Playgroud)

c 对象是一个惰性 Delayed 对象。该对象包含计算最终结果所需的所有内容,包括对所需所有函数及其输入和相互关系的引用。