Sklearn Transformers：如何将编码器应用于多个列并在生产中重用它？

Question

Sklearn Transformers：如何将编码器应用于多个列并在生产中重用它？

atp*_*atp 6 python machine-learning python-3.x scikit-learn

我在训练期间使用标签编码器，并希望通过保存并稍后加载来在生产中使用相同的编码器。无论我在网上找到什么解决方案，都只允许标签编码器一次应用于单个列，如下所示：

for col in col_list:
    df[col]= df[[col]].apply(LabelEncoder().fit_transform)

Run Code Online (Sandbox Code Playgroud)

这种情况下如何保存并以后使用呢？因为我尝试拟合整个数据框，但出现以下错误。

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
C:\Users\DA~1\AppData\Local\Temp/ipykernel_3884/730613134.py in <module>
----> 1 l_enc.fit_transform(df_join[le_col].astype(str))

~\anaconda3\envs\ReturnRate\lib\site-packages\sklearn\preprocessing\_label.py in fit_transform(self, y)
    113             Encoded labels.
    114         """
--> 115         y = column_or_1d(y, warn=True)
    116         self.classes_, y = _unique(y, return_inverse=True)
    117         return y

~\anaconda3\envs\ReturnRate\lib\site-packages\sklearn\utils\validation.py in column_or_1d(y, warn)
   1022         return np.ravel(y)
   1023 
-> 1024     raise ValueError(
   1025         "y should be a 1d array, got an array of shape {} instead.".format(shape)
   1026     )

ValueError: y should be a 1d array, got an array of shape (3949037, 14) instead.

Run Code Online (Sandbox Code Playgroud)

我想将标签编码器安装到具有 10 列（所有分类）的数据帧中，保存它并稍后在生产中加载它。

Answer 1

Stu*_*olf 3

首先，我想指出labelEncoder是用于对目标变量进行编码的。如果您对预测变量应用 labelEncoder，则会使它们连续，例如 0,1,2,3 等，这可能没有意义。

对于分类预测变量，您应该使用onehotencoding。

如果你确定 labelencode，它是这样的：

from sklearn.preprocessing import LabelEncoder
import pandas as pd
import numpy as np

df = pd.DataFrame({'f1':np.random.choice(['a','b','c'],100),
'f2':np.random.choice(['x','y','z'],100)})

col_list = ['f1','f2']

df[col_list].apply(LabelEncoder().fit_transform)

Run Code Online (Sandbox Code Playgroud)

如果要保留编码器，可以将其存储在字典中：

le = {}
for col in col_list:
    le[col] = LabelEncoder().fit(df[col].values)

le['f1'].transform(df['f1'])

array([1, 0, 2, 0, 2, 0, 2, 1, 1, 2, 0, 1, 2, 1, 1, 1, 0, 2, 1, 2, 1, 2,
       2, 2, 0, 1, 1, 1, 2, 1, 1, 1, 2, 2, 2, 2, 2, 1, 1, 2, 2, 2, 1, 1,
       0, 1, 1, 1, 2, 2, 1, 0, 2, 1, 2, 2, 2, 1, 0, 0, 2, 2, 0, 1, 2, 2,
       0, 2, 1, 2, 1, 1, 1, 1, 1, 2, 2, 0, 2, 0, 1, 1, 1, 0, 2, 0, 0, 2,
       0, 1, 1, 2, 1, 0, 0, 2, 0, 1, 1, 2])

for col in col_list:
    df[col] = le[col].transform(df[col])

Run Code Online (Sandbox Code Playgroud)

我会再次更多地考虑使用 labelEncoding 是否正确。

归档时间：	4 年，3 月前
查看次数：	2489 次
最近记录：	4 年，2 月前