chi*_*a g 16 python scikit-learn
我正在尝试使用LabelEncoder然后使用OneHotEncoder将分类值(在我的情况下是“国家/地区”列)转换为编码后的值,并且能够转换分类值。但是我收到警告,就像不赞成使用OneHotEncoder'categorical_features'关键字“改为使用ColumnTransformer”。那么我如何使用ColumnTransformer来达到相同的结果呢?
以下是我的输入数据集和我尝试过的代码
Input Data set
Country Age Salary
France 44 72000
Spain 27 48000
Germany 30 54000
Spain 38 61000
Germany 40 67000
France 35 58000
Spain 26 52000
France 48 79000
Germany 50 83000
France 37 67000
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
#X is my dataset variable name
label_encoder = LabelEncoder()
x.iloc[:,0] = label_encoder.fit_transform(x.iloc[:,0]) #LabelEncoder is used to encode the country value
hot_encoder = OneHotEncoder(categorical_features = [0])
x = hot_encoder.fit_transform(x).toarray()
Run Code Online (Sandbox Code Playgroud)
和我得到的输出,如何使用列变压器获得相同的输出
0(fran) 1(ger) 2(spain) 3(age) 4(salary)
1 0 0 44 72000
0 0 1 27 48000
0 1 0 30 54000
0 0 1 38 61000
0 1 0 40 67000
1 0 0 35 58000
0 0 1 36 52000
1 0 0 48 79000
0 1 0 50 83000
1 0 0 37 67000
Run Code Online (Sandbox Code Playgroud)
我尝试了以下代码
from sklearn.compose import ColumnTransformer, make_column_transformer
preprocess = make_column_transformer(
( [0], OneHotEncoder())
)
x = preprocess.fit_transform(x).toarray()
Run Code Online (Sandbox Code Playgroud)
我能够使用上面的代码对国家/地区列进行编码,但是转换后缺少x变量的年龄和薪水列
Pra*_*iel 10
您想将连续数据编码为Salary是很奇怪的。除非您将薪水划分到特定范围/类别,否则这没有任何意义。如果我在你要去的地方,我会做:
import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
numeric_features = ['Salary']
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())])
categorical_features = ['Age','Country']
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('onehot', OneHotEncoder(handle_unknown='ignore'))])
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)])
Run Code Online (Sandbox Code Playgroud)
从这里,您可以将其与分类器进行管道传输,例如
clf = Pipeline(steps=[('preprocessor', preprocessor),
('classifier', LogisticRegression(solver='lbfgs'))])
Run Code Online (Sandbox Code Playgroud)
如此使用:
clf.fit(X_train,y_train)
Run Code Online (Sandbox Code Playgroud)
这将应用预处理器,然后将转换后的数据传递给预测器。
小智 10
我认为发布者并没有试图改变年龄和薪水。从文档(https://scikit-learn.org/stable/modules/generated/sklearn.compose.make_column_transformer.html)中,您仅使用ColumnTransformer(和make_column_transformer)指定了转换器中的列(即示例中的[0]) )。您应该设置restder =“ passthrough”来获取其余的列。换一种说法:
preprocessor = make_column_transformer( (OneHotEncoder(),[0]),remainder="passthrough")
x = preprocessor.fit_transform(x)
Run Code Online (Sandbox Code Playgroud)
最简单的方法是在你的 CVS 数据框上使用 pandas dummy
dataset = pd.read_csv("yourfile.csv")
dataset = pd.get_dummies(dataset,columns=['Country'])
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
10858 次 |
| 最近记录: |