tud*_*dou 7 python-3.x scikit-learn
我有一个使用 ColumnTransformer 的带有管道的简单模型
我能够训练模型并将模型保存为泡菜
当我加载泡菜并预测实时数据时,我收到以下关于 ColumnTransformer 的错误
使用剩余关键字时,列顺序对于拟合和变换必须相等
训练数据和用于预测的数据具有完全相同的列数,例如 50。我不确定该列的“排序”如何改变。
为什么列的排序对于 columntransformer 很重要?如何解决这个问题?有没有办法在运行柱式变压器后确保“排序”?
谢谢。
pipeline = Pipeline([
('RepalceInf', ReplaceInf()),
('impute_30_100', ColumnTransformer(
[
('oneStdNorm', OneStdImputer(), self.cont_feature_strategy_dict['FEATS_30_100']),
],
remainder='passthrough'
)),
('regress_impute', IterativeImputer(random_state=0, estimator=self.cont_estimator)),
('replace_outlier', OutlierReplacer(quantile_range=(1, 99))),
('scaler', StandardScaler(with_mean=True))
])
class OneStdImputer(TransformerMixin, BaseEstimator):
def __init__(self):
"""
Impute the missing data with random value in the range of mean +/- one standard deviation
This is a simplified implementation without sparse/dense fit and check.
"""
self.mean = None
self.std = None
def fit(self, X, y=None):
self.mean = X.mean()
self.std = X.std()
return self
def transform(self, X):
# X_imp = X.fillna(np.random.randint()*2*self.std+self.mean-self.std)
for col in X:
self._fill_randnorm(X[col], col)
return X
def _fill_randnorm(self, df, col):
val = df.values
mask = np.isnan(df)
mu, sigma = self.mean[col], self.std[col]
val[mask] = np.random.normal(mu, sigma, size=mask.sum())
return df
Run Code Online (Sandbox Code Playgroud)
小智 4
您可以使用它df_new =pd.DataFrame(df_origin, columns=df_train.columns来确保要预测的数据与训练数据具有相同的列。
从给定的示例中,很明显,它将以所选列的顺序号ColumnTransformer作为标记进行处理。(虽然您可以使用确切的名称来选择列,但我认为它也会转换为数字)
>>> import numpy as np
>>> from sklearn.compose import ColumnTransformer
>>> from sklearn.preprocessing import Normalizer
>>> ct = ColumnTransformer(
... [("norm1", Normalizer(norm='l1'), [0, 1]),
... ("norm2", Normalizer(norm='l1'), slice(2, 4))])
>>> X = np.array([[0., 1., 2., 2.],
... [1., 1., 0., 1.]])
>>> # Normalizer scales each row of X to unit norm. A separate scaling
>>> # is applied for the two first and two last elements of each
>>> # row independently.
>>> ct.fit_transform(X)
array([[0. , 1. , 0.5, 0.5],
[0.5, 0.5, 0. , 1. ]])
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
3726 次 |
| 最近记录: |