如何反转sklearn.OneHotEncoder转换以恢复原始数据？

Question

如何反转sklearn.OneHotEncoder转换以恢复原始数据？

Phy*_*ece 14 python machine-learning scipy scikit-learn

我使用分类数据编码sklearn.OneHotEncoder并将其输入随机森林分类器.一切似乎都有效,我得到了预测的输出.

有没有办法扭转编码并将我的输出转换回原始状态？

Answer 1

解决这个问题的一个好的系统方法是从一些测试数据开始,然后sklearn.OneHotEncoder使用它来完成源代码.如果您不太关心它是如何工作的,只是想快速回答,请跳到底部.

X = np.array([
    [3, 10, 15, 33, 54, 55, 78, 79, 80, 99],
    [5, 1, 3, 7, 8, 12, 15, 19, 20, 8]
]).T

Run Code Online (Sandbox Code Playgroud)

n_values_

第1763-1786行确定n_values_参数.如果设置n_values='auto'(默认值),将自动确定.或者,您可以为所有要素(int)指定最大值,或为每个要素(数组)指定最大值.我们假设我们使用的是默认值.所以以下几行执行:

n_samples, n_features = X.shape    # 10, 2
n_values = np.max(X, axis=0) + 1   # [100, 21]
self.n_values_ = n_values

Run Code Online (Sandbox Code Playgroud)

feature_indices_

接下来feature_indices_计算参数.

n_values = np.hstack([[0], n_values])  # [0, 100, 21]
indices = np.cumsum(n_values)          # [0, 100, 121]
self.feature_indices_ = indices

Run Code Online (Sandbox Code Playgroud)

所以feature_indices_只是n_values_前缀为0 的累积和.

稀疏矩阵构造

接下来,scipy.sparse.coo_matrix从数据构造a .它从三个数组初始化:稀疏数据(全1),行索引和列索引.

column_indices = (X + indices[:-1]).ravel()
# array([  3, 105,  10, 101,  15, 103,  33, 107,  54, 108,  55, 112,  78, 115,  79, 119,  80, 120,  99, 108])

row_indices = np.repeat(np.arange(n_samples, dtype=np.int32), n_features)
# array([0, 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9], dtype=int32)

data = np.ones(n_samples * n_features)
# array([ 1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1., 1.,  1.,  1.,  1.,  1.,  1.,  1.])

out = sparse.coo_matrix((data, (row_indices, column_indices)),
                        shape=(n_samples, indices[-1]),
                        dtype=self.dtype).tocsr()
# <10x121 sparse matrix of type '<type 'numpy.float64'>' with 20 stored elements in Compressed Sparse Row format>

Run Code Online (Sandbox Code Playgroud)

请注意,它coo_matrix会立即转换为a scipy.sparse.csr_matrix.它coo_matrix被用作中间格式,因为它"有助于稀疏格式之间的快速转换".

active_features_

现在,如果n_values='auto',稀疏csr矩阵被压缩为仅具有活动特征的列.csr_matrix如果返回稀疏sparse=True,否则在返回之前它被致密化.

if self.n_values == 'auto':
    mask = np.array(out.sum(axis=0)).ravel() != 0
    active_features = np.where(mask)[0]  # array([  3,  10,  15,  33,  54,  55,  78,  79,  80,  99, 101, 103, 105, 107, 108, 112, 115, 119, 120])
    out = out[:, active_features]  # <10x19 sparse matrix of type '<type 'numpy.float64'>' with 20 stored elements in Compressed Sparse Row format>
    self.active_features_ = active_features

return out if self.sparse else out.toarray()

Run Code Online (Sandbox Code Playgroud)

解码

现在让我们相反.我们想知道如何恢复X给定返回的稀疏矩阵以及OneHotEncoder上面详述的功能.让我们假设我们实际上通过实例化一个新的OneHotEncoder并运行fit_transform我们的数据来运行上面的代码X.

from sklearn import preprocessing
ohc = preprocessing.OneHotEncoder()  # all default params
out = ohc.fit_transform(X)

Run Code Online (Sandbox Code Playgroud)

关键洞察解决这个问题是理解之间的关系active_features_和out.indices.对于a csr_matrix,indices数组包含每个数据点的列号.但是,不保证对这些列号进行排序.要对它们进行排序,我们可以使用该sorted_indices方法.

out.indices  # array([12,  0, 10,  1, 11,  2, 13,  3, 14,  4, 15,  5, 16,  6, 17,  7, 18, 8, 14,  9], dtype=int32)
out = out.sorted_indices()
out.indices  # array([ 0, 12,  1, 10,  2, 11,  3, 13,  4, 14,  5, 15,  6, 16,  7, 17,  8, 18,  9, 14], dtype=int32)

Run Code Online (Sandbox Code Playgroud)

我们可以看到,在排序之前,索引实际上是沿着行反转的.换句话说,它们的排序是最后一列,最后一列是最后一列.从前两个要素可以看出这一点:[12,0].0对应于第一列中的X3,因为3是它分配给第一个活动列的最小元素.12对应于第二列中的5 X.由于第一行占用10个不同的列,因此第二列(1)的最小元素获得索引10.下一个最小(3)获得索引11,第三个最小(5)获得索引12.排序后,索引是按照我们的预期订购.

接下来我们来看看active_features_:

ohc.active_features_  # array([  3,  10,  15,  33,  54,  55,  78,  79,  80,  99, 101, 103, 105, 107, 108, 112, 115, 119, 120])

Run Code Online (Sandbox Code Playgroud)

请注意,有19个元素,对应于数据中不同元素的数量(一个元素,8,重复一次).另请注意,这些按顺序排列.第一列中的特征X是相同的,第二列中的特征简单地与100相加,这对应于ohc.feature_indices_[1].

回过头out.indices来看,我们可以看到最大列数是18,这是我们编码中的19个活动特征的减1.对这里关系的一点想法表明,索引ohc.active_features_对应于中的列号ohc.indices.有了这个,我们可以解码:

import numpy as np
decode_columns = np.vectorize(lambda col: ohc.active_features_[col])
decoded = decode_columns(out.indices).reshape(X.shape)

Run Code Online (Sandbox Code Playgroud)

这给了我们:

array([[  3, 105],
       [ 10, 101],
       [ 15, 103],
       [ 33, 107],
       [ 54, 108],
       [ 55, 112],
       [ 78, 115],
       [ 79, 119],
       [ 80, 120],
       [ 99, 108]])

Run Code Online (Sandbox Code Playgroud)

我们可以通过从以下位置减去偏移来恢复原始要素值ohc.feature_indices_:

recovered_X = decoded - ohc.feature_indices_[:-1]
array([[ 3,  5],
       [10,  1],
       [15,  3],
       [33,  7],
       [54,  8],
       [55, 12],
       [78, 15],
       [79, 19],
       [80, 20],
       [99,  8]])

Run Code Online (Sandbox Code Playgroud)

请注意,您需要具有原始形状X,这很简单(n_samples, n_features).

TL; DR

给定sklearn.OneHotEncoder调用的实例ohc,scipy.sparse.csr_matrix输出ohc.fit_transform或ohc.transform调用的编码数据()out以及原始数据的形状将(n_samples, n_feature)恢复原始数据X:

recovered_X = np.array([ohc.active_features_[col] for col in out.sorted_indices().indices])
                .reshape(n_samples, n_features) - ohc.feature_indices_[:-1]

Run Code Online (Sandbox Code Playgroud)

Answer 2

Nic*_*ico 9

numpy.argmax()与一起使用axis = 1。

例子：

ohe_encoded = np.array([[0, 0, 1], [0, 1, 0], [0, 1, 0], [1, 0, 0]])
ohe_encoded
> array([[0, 0, 1],
         [0, 1, 0],
         [0, 1, 0],
         [1, 0, 0]])

np.argmax(ohe_encoded, axis = 1)
> array([2, 1, 1, 0], dtype=int64)

Run Code Online (Sandbox Code Playgroud)

Answer 3

Boh*_*nik 7

只需计算编码值的点积ohe.active_features_.它适用于稀疏和密集表示.例:

from sklearn.preprocessing import OneHotEncoder
import numpy as np

orig = np.array([6, 9, 8, 2, 5, 4, 5, 3, 3, 6])

ohe = OneHotEncoder()
encoded = ohe.fit_transform(orig.reshape(-1, 1)) # input needs to be column-wise

decoded = encoded.dot(ohe.active_features_).astype(int)
assert np.allclose(orig, decoded)

Run Code Online (Sandbox Code Playgroud)

关键的见解是active_features_OHE模型的属性代表每个二进制列的原始值.因此,我们可以通过简单地计算点积来解码二进制编码的数字active_features_.对于每个数据点,只有一个1原始值的位置.

Answer 4

mel*_*des 5

从 scikit-learn 0.20 版本开始，该类active_features_的属性OneHotEncoder已被弃用，因此我建议改为依赖该categories_属性。

下面的函数可以帮助您从经过 one-hot 编码的矩阵中恢复原始数据：

def reverse_one_hot(X, y, encoder):
    reversed_data = [{} for _ in range(len(y))]
    all_categories = list(itertools.chain(*encoder.categories_))
    category_names = ['category_{}'.format(i+1) for i in range(len(encoder.categories_))]
    category_lengths = [len(encoder.categories_[i]) for i in range(len(encoder.categories_))]

    for row_index, feature_index in zip(*X.nonzero()):
        category_value = all_categories[feature_index]
        category_name = get_category_name(feature_index, category_names, category_lengths)
        reversed_data[row_index][category_name] = category_value
        reversed_data[row_index]['target'] = y[row_index]

    return reversed_data


def get_category_name(index, names, lengths):

    counter = 0
    for i in range(len(lengths)):
        counter += lengths[i]
        if index < counter:
            return names[i]
    raise ValueError('The index is higher than the number of categorical values')

Run Code Online (Sandbox Code Playgroud)

为了测试它，我创建了一个小型数据集，其中包括用户给予用户的评分

data = [
    {'user_id': 'John', 'item_id': 'The Matrix', 'rating': 5},
    {'user_id': 'John', 'item_id': 'Titanic', 'rating': 1},
    {'user_id': 'John', 'item_id': 'Forrest Gump', 'rating': 2},
    {'user_id': 'John', 'item_id': 'Wall-E', 'rating': 2},
    {'user_id': 'Lucy', 'item_id': 'The Matrix', 'rating': 5},
    {'user_id': 'Lucy', 'item_id': 'Titanic', 'rating': 1},
    {'user_id': 'Lucy', 'item_id': 'Die Hard', 'rating': 5},
    {'user_id': 'Lucy', 'item_id': 'Forrest Gump', 'rating': 2},
    {'user_id': 'Lucy', 'item_id': 'Wall-E', 'rating': 2},
    {'user_id': 'Eric', 'item_id': 'The Matrix', 'rating': 2},
    {'user_id': 'Eric', 'item_id': 'Die Hard', 'rating': 3},
    {'user_id': 'Eric', 'item_id': 'Forrest Gump', 'rating': 5},
    {'user_id': 'Eric', 'item_id': 'Wall-E', 'rating': 4},
    {'user_id': 'Diane', 'item_id': 'The Matrix', 'rating': 4},
    {'user_id': 'Diane', 'item_id': 'Titanic', 'rating': 3},
    {'user_id': 'Diane', 'item_id': 'Die Hard', 'rating': 5},
    {'user_id': 'Diane', 'item_id': 'Forrest Gump', 'rating': 3},
]

data_frame = pandas.DataFrame(data)
data_frame = data_frame[['user_id', 'item_id', 'rating']]
ratings = data_frame['rating']
data_frame.drop(columns=['rating'], inplace=True)

Run Code Online (Sandbox Code Playgroud)

DataFrame如果我们正在构建预测模型，我们必须记住在编码之前删除因变量（在本例中为评级）。

ratings = data_frame['rating']
data_frame.drop(columns=['rating'], inplace=True)

Run Code Online (Sandbox Code Playgroud)

然后我们继续进行编码

ohc = OneHotEncoder()
encoded_data = ohc.fit_transform(data_frame)
print(encoded_data)

Run Code Online (Sandbox Code Playgroud)

结果是：

  (0, 2)    1.0
  (0, 6)    1.0
  (1, 2)    1.0
  (1, 7)    1.0
  (2, 2)    1.0
  (2, 5)    1.0
  (3, 2)    1.0
  (3, 8)    1.0
  (4, 3)    1.0
  (4, 6)    1.0
  (5, 3)    1.0
  (5, 7)    1.0
  (6, 3)    1.0
  (6, 4)    1.0
  (7, 3)    1.0
  (7, 5)    1.0
  (8, 3)    1.0
  (8, 8)    1.0
  (9, 1)    1.0
  (9, 6)    1.0
  (10, 1)   1.0
  (10, 4)   1.0
  (11, 1)   1.0
  (11, 5)   1.0
  (12, 1)   1.0
  (12, 8)   1.0
  (13, 0)   1.0
  (13, 6)   1.0
  (14, 0)   1.0
  (14, 7)   1.0
  (15, 0)   1.0
  (15, 4)   1.0
  (16, 0)   1.0
  (16, 5)   1.0

Run Code Online (Sandbox Code Playgroud)

编码后，我们可以使用reverse_one_hot上面定义的函数进行反转，如下所示：

reverse_data = reverse_one_hot(encoded_data, ratings, ohc)
print(pandas.DataFrame(reverse_data))

Run Code Online (Sandbox Code Playgroud)

这给了我们：

   category_1    category_2  target
0        John    The Matrix       5
1        John       Titanic       1
2        John  Forrest Gump       2
3        John        Wall-E       2
4        Lucy    The Matrix       5
5        Lucy       Titanic       1
6        Lucy      Die Hard       5
7        Lucy  Forrest Gump       2
8        Lucy        Wall-E       2
9        Eric    The Matrix       2
10       Eric      Die Hard       3
11       Eric  Forrest Gump       5
12       Eric        Wall-E       4
13      Diane    The Matrix       4
14      Diane       Titanic       3
15      Diane      Die Hard       5
16      Diane  Forrest Gump       3

Run Code Online (Sandbox Code Playgroud)

Answer 5

blu*_*lds 0

最简洁的答案是不”。编码器获取您的分类数据并自动将其转换为一组合理的数字。

更长的答案是“不会自动”。不过，如果您使用 n_values 参数提供显式映射，您可能可以在另一端实现自己的解码。请参阅文档以获取有关如何完成此操作的一些提示。

也就是说，这是一个相当奇怪的问题。您可能想使用DictVectorizer

我觉得我也同样缺乏理解。为什么这是一个奇怪的问题？如果不解码，我将无法判断编码为 0,1 的哪个因子与哪个系数配对 (6认同)

归档时间：	11 年，9 月前
查看次数：	16140 次
最近记录：	6 年，2 月前