ps0*_*604 6 python pandas dask dask-ml
我可以像这样使用 dask-ml 估算平均值和最频繁的值,这很好用:
mean_imputer = impute.SimpleImputer(strategy='mean')
most_frequent_imputer = impute.SimpleImputer(strategy='most_frequent')
data = [[100, 2, 5], [np.nan, np.nan, np.nan], [70, 7, 5]]
df = pd.DataFrame(data, columns = ['Weight', 'Age', 'Height'])
df.iloc[:, [0,1]] = mean_imputer.fit_transform(df.iloc[:,[0,1]])
df.iloc[:, [2]] = most_frequent_imputer.fit_transform(df.iloc[:,[2]])
print(df)
Weight Age Height
0 100.0 2.0 5.0
1 85.0 4.5 5.0
2 70.0 7.0 5.0
Run Code Online (Sandbox Code Playgroud)
但是,如果我有 1 亿行数据,那么当 dask 可以只执行一个循环时,它似乎会执行两个循环,是否可以同时和/或并行而不是按顺序运行两个输入器?实现这一目标的示例代码是什么?
如果实体彼此独立,您可以按照文档和Dask 教程中的建议使用dask.delayed来并行计算。
你的代码看起来像:
from dask.distributed import Client
client = Client(n_workers=4)
from dask import delayed
import numpy as np
import pandas as pd
from dask_ml import impute
mean_imputer = impute.SimpleImputer(strategy='mean')
most_frequent_imputer = impute.SimpleImputer(strategy='most_frequent')
def fit_transform_mi(d):
return mean_imputer.fit_transform(d)
def fit_transform_mfi(d):
return most_frequent_imputer.fit_transform(d)
def setdf(a,b,df):
df.iloc[:, [0,1]]=a
df.iloc[:, [2]]=b
return df
data = [[100, 2, 5], [np.nan, np.nan, np.nan], [70, 7, 5]]
df = pd.DataFrame(data, columns = ['Weight', 'Age', 'Height'])
a = delayed(fit_transform_mi)(df.iloc[:,[0,1]])
b = delayed(fit_transform_mfi)(df.iloc[:,[2]])
c = delayed(setdf)(a,b,df)
df= c.compute()
print(df)
client.close()
Run Code Online (Sandbox Code Playgroud)
c 对象是一个惰性 Delayed 对象。该对象包含计算最终结果所需的所有内容,包括对所需所有函数及其输入和相互关系的引用。
| 归档时间: |
|
| 查看次数: |
249 次 |
| 最近记录: |