Scikitlearn Column Transformer 错误:使用剩余关键字时,列顺序对于拟合和变换必须相等

tud*_*dou 7 python-3.x scikit-learn

我有一个使用 ColumnTransformer 的带有管道的简单模型

我能够训练模型并将模型保存为泡菜

当我加载泡菜并预测实时数据时,我收到以下关于 ColumnTransformer 的错误

使用剩余关键字时,列顺序对于拟合和变换必须相等

训练数据和用于预测的数据具有完全相同的列数,例如 50。我不确定该列的“排序”如何改变。

为什么列的排序对于 columntransformer 很重要?如何解决这个问题?有没有办法在运行柱式变压器后确保“排序”?

谢谢。

   pipeline = Pipeline([
        ('RepalceInf', ReplaceInf()),
        ('impute_30_100', ColumnTransformer(
            [
                ('oneStdNorm', OneStdImputer(), self.cont_feature_strategy_dict['FEATS_30_100']),
            ],
            remainder='passthrough'
        )),
        ('regress_impute', IterativeImputer(random_state=0, estimator=self.cont_estimator)),
        ('replace_outlier', OutlierReplacer(quantile_range=(1, 99))),
        ('scaler', StandardScaler(with_mean=True))
    ])



class OneStdImputer(TransformerMixin, BaseEstimator):
def __init__(self):
    """
    Impute the missing data with random value in the range of mean +/- one standard deviation
    This is a simplified implementation without sparse/dense fit and check.
    """
    self.mean = None
    self.std = None

def fit(self, X, y=None):
    self.mean = X.mean()
    self.std = X.std()
    return self

def transform(self, X):
    # X_imp = X.fillna(np.random.randint()*2*self.std+self.mean-self.std)
    for col in X:
        self._fill_randnorm(X[col], col)
    return X

def _fill_randnorm(self, df, col):
    val = df.values
    mask = np.isnan(df)
    mu, sigma = self.mean[col], self.std[col]
    val[mask] = np.random.normal(mu, sigma, size=mask.sum())
    return df
Run Code Online (Sandbox Code Playgroud)

小智 4

您可以使用它df_new =pd.DataFrame(df_origin, columns=df_train.columns来确保要预测的数据与训练数据具有相同的

从给定的示例中,很明显,它将以所选列的顺序号ColumnTransformer作为标记进行处理。(虽然您可以使用确切的名称来选择列,但我认为它也会转换为数字)

>>> import numpy as np
>>> from sklearn.compose import ColumnTransformer
>>> from sklearn.preprocessing import Normalizer
>>> ct = ColumnTransformer(
...     [("norm1", Normalizer(norm='l1'), [0, 1]),
...      ("norm2", Normalizer(norm='l1'), slice(2, 4))])
>>> X = np.array([[0., 1., 2., 2.],
...               [1., 1., 0., 1.]])
>>> # Normalizer scales each row of X to unit norm. A separate scaling
>>> # is applied for the two first and two last elements of each
>>> # row independently.
>>> ct.fit_transform(X)
array([[0. , 1. , 0.5, 0.5],
       [0.5, 0.5, 0. , 1. ]])
Run Code Online (Sandbox Code Playgroud)