将StandardScaler应用于数据集的部分部分

Question

将StandardScaler应用于数据集的部分部分

mit*_*tsi 8 python scale pandas scikit-learn data-science

我想使用来自sklearn的StandardScaler的几个方法.是否可以在我的集合的某些列/功能上使用这些方法,而不是将它们应用于整个集合.

例如,该集合是sklearn:

data = pd.DataFrame({'Name' : [3, 4,6], 'Age' : [18, 92,98], 'Weight' : [68, 59,49]})

   Age  Name  Weight
0   18     3      68
1   92     4      59
2   98     6      49


col_names = ['Name', 'Age', 'Weight']
features = data[col_names]

Run Code Online (Sandbox Code Playgroud)

我适合并改造了 StandardScaler

scaler = StandardScaler().fit(features.values)
features = scaler.transform(features.values)
scaled_features = pd.DataFrame(features, columns = col_names)

       Name       Age    Weight
0 -1.069045 -1.411004  1.202703
1 -0.267261  0.623041  0.042954
2  1.336306  0.787964 -1.245657

Run Code Online (Sandbox Code Playgroud)

但当然名称不是浮点数而是字符串,我不想将它们标准化.我怎样才能应用data和data功能只在列fit和transform？

Answer 1

ayh*_*han 14

首先创建数据框的副本:

scaled_features = data.copy()

Run Code Online (Sandbox Code Playgroud)

不要在转换中包含Name列:

col_names = ['Age', 'Weight']
features = scaled_features[col_names]
scaler = StandardScaler().fit(features.values)
features = scaler.transform(features.values)

Run Code Online (Sandbox Code Playgroud)

现在,不要创建新的数据帧,而是将结果分配给这两列:

scaled_features[col_names] = features
print(scaled_features)


        Age  Name    Weight
0 -1.411004     3  1.202703
1  0.623041     4  0.042954
2  0.787964     6 -1.245657

Run Code Online (Sandbox Code Playgroud)

Answer 2

Ale*_*lex 8

聚会迟到了，但这是我首选的解决方案：

#load data
data = pd.DataFrame({'Name' : [3, 4,6], 'Age' : [18, 92,98], 'Weight' : [68, 59,49]})

#list for cols to scale
cols_to_scale = ['Age','Weight']

#create and fit scaler
scaler = StandardScaler()
scaler.fit(data[cols_to_scale])

#scale selected data
data[cols_to_scale] = scaler.transform(data[cols_to_scale])

Run Code Online (Sandbox Code Playgroud)

Answer 3

Guy*_*y C 7

v0.20中引入了ColumnTransformer，它将转换器应用于数组或熊猫DataFrame的指定列集。

import pandas as pd
data = pd.DataFrame({'Name' : [3, 4,6], 'Age' : [18, 92,98], 'Weight' : [68, 59,49]})

col_names = ['Name', 'Age', 'Weight']
features = data[col_names]

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler

ct = ColumnTransformer([
        ('somename', StandardScaler(), ['Age', 'Weight'])
    ], remainder='passthrough')

ct.fit_transform(features)

Run Code Online (Sandbox Code Playgroud)

注意：像管道一样，它也有一个简写的make_column_transformer版本，不需要命名转换器。

输出量

-1.41100443,  1.20270298,  3.       
 0.62304092,  0.04295368,  4.       
 0.78796352, -1.24565666,  6.

Run Code Online (Sandbox Code Playgroud)

很好的答案！如果我使用 pandas 数据框执行此操作，如何保留列名称？有没有一种方法无需在最后重命名所有列？ (4认同)
这是现在最好的答案（不需要您复制数据框） (2认同)
接受的答案不保留列名称，因此很差。而是在衬里上使用它：`data[['Age', 'Weight']] = StandardScaler().fit_transform(data[['Age', 'Weight']])` (2认同)

归档时间：	9 年，3 月前
查看次数：	14009 次
最近记录：	6 年，3 月前