Jav*_*dra 14 python pipeline scikit-learn
我对sklearn中的管道很新,我遇到了这个问题:我有一个混合了文本和数字的数据集,即某些列只有文本而rest有整数(或浮点数).
我想知道是否有可能构建一个管道,我可以调用LabelEncoder()
文本功能和MinMaxScaler()
数字列.我在网上看到的例子主要指向使用LabelEncoder()
整个数据集而不是选择列.这可能吗?如果是这样,任何指针都将非常感激.
max*_*moo 23
我通常这样做的方法是FeatureUnion
使用a FunctionTransformer
来拉出相关的列.
重要笔记:
你必须定义你的函数,def
因为如果你想挑选你的模型,你不能使用lambda
或partial
在FunctionTransformer中
您需要初始化FunctionTransformer
与validate=False
像这样的东西:
from sklearn.pipeline import make_union, make_pipeline
from sklearn.preprocessing import FunctionTransformer
def get_text_cols(df):
return df[['name', 'fruit']]
def get_num_cols(df):
return df[['height','age']]
vec = make_union(*[
make_pipeline(FunctionTransformer(get_text_cols, validate=False), LabelEncoder()))),
make_pipeline(FunctionTransformer(get_num_cols, validate=False), MinMaxScaler())))
])
Run Code Online (Sandbox Code Playgroud)
LC1*_*117 10
ColumnTransformer的示例可能会帮助您:
# FOREGOING TRANSFORMATIONS ON 'data' ...
# filter data
data = data[data['county'].isin(COUNTIES_OF_INTEREST)]
# define the feature encoding of the data
impute_and_one_hot_encode = Pipeline([
('impute', SimpleImputer(strategy='most_frequent')),
('encode', OneHotEncoder(sparse=False, handle_unknown='ignore'))
])
featurisation = ColumnTransformer(transformers=[
("impute_and_one_hot_encode", impute_and_one_hot_encode, ['smoker', 'county', 'race']),
('word2vec', MyW2VTransformer(min_count=2), ['last_name']),
('numeric', StandardScaler(), ['num_children', 'income'])
])
# define the training pipeline for the model
neural_net = KerasClassifier(build_fn=create_model, epochs=10, batch_size=1, verbose=0, input_dim=109)
pipeline = Pipeline([
('features', featurisation),
('learner', neural_net)])
# train-test split
train_data, test_data = train_test_split(data, random_state=0)
# model training
model = pipeline.fit(train_data, train_data['label'])
Run Code Online (Sandbox Code Playgroud)
您可以在以下位置找到完整代码:https://github.com/stefan-grafberger/mlinspect/blob/19ca0d6ae8672249891835190c9e2d9d3c14f28f/example_pipelines/healthcare/healthcare.py
归档时间: |
|
查看次数: |
5630 次 |
最近记录: |