我正在尝试使用 SKLearn Pipelines 和 ColumnTransformer 编写预处理。然而,变压器返回一个数组(而不是数据帧)这一事实让我有点失望。我希望也能够在已处理的 df 上使用列名称。想象一下以下数据和管道:
import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
df = pd.DataFrame(np.random.randn(6, 4), columns=list("ABCD"))
df["E"] = pd.Categorical(["test", "train", "test", "train", "test", "train"])
df["F"] = "foo"
num_columns = ['A', 'B', 'C']
num_transformer = Pipeline(
steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler()),
]
)
cat_columns = ['E', 'F']
cat_transformer = Pipeline(
steps = [
('imputer', SimpleImputer(strategy='most_frequent')), …Run Code Online (Sandbox Code Playgroud)