Neg*_*ion 1 binary-data categorical-data dummy-variable one-hot-encoding
我正在尝试了解此二进制编码器背后的逻辑。
它会自动获取分类变量并对它们进行虚拟编码(类似于sklearn上的一键编码),但是减少的输出列数等于唯一值长度的log2。
基本上,当我使用该库时,我注意到我的虚拟变量仅限于少数唯一值。在进一步研究中,我注意到了这一点@staticmethod
,它在分类变量中采用了唯一值len的log2。
我的问题是为什么?我意识到这降低了输出数据的维数,但是这样做的背后逻辑是什么?使用log2如何确定表示数据所需的位数?
def calc_required_digits(X, col):
"""
figure out how many digits we need to represent the classes present
"""
return int( np.ceil(np.log2(len(X[col].unique()))) )
Run Code Online (Sandbox Code Playgroud)
完整的源代码:
"""Binary encoding"""
import copy
import pandas as pd
import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin
from category_encoders.ordinal import OrdinalEncoder
from category_encoders.utils import get_obj_cols, convert_input
__author__ = 'willmcginnis'
[docs]class BinaryEncoder(BaseEstimator, TransformerMixin):
"""Binary encoding for categorical variables, similar to onehot, but stores categories as binary bitstrings.
Parameters
----------
verbose: int
integer indicating verbosity of output. 0 for none.
cols: list
a list of columns to encode, if None, all string columns will be encoded
drop_invariant: bool
boolean for whether or not to drop columns with 0 variance
return_df: bool
boolean for whether to return a pandas DataFrame from transform (otherwise it will be a numpy array)
impute_missing: bool
boolean for whether or not to apply the logic for handle_unknown, will be deprecated in the future.
handle_unknown: str
options are 'error', 'ignore' and 'impute', defaults to 'impute', which will impute the category -1. Warning: if
impute is used, an extra column will be added in if the transform matrix has unknown categories. This can causes
unexpected changes in dimension in some cases.
Example
-------
>>>from category_encoders import *
>>>import pandas as pd
>>>from sklearn.datasets import load_boston
>>>bunch = load_boston()
>>>y = bunch.target
>>>X = pd.DataFrame(bunch.data, columns=bunch.feature_names)
>>>enc = BinaryEncoder(cols=['CHAS', 'RAD']).fit(X, y)
>>>numeric_dataset = enc.transform(X)
>>>print(numeric_dataset.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 16 columns):
CHAS_0 506 non-null int64
RAD_0 506 non-null int64
RAD_1 506 non-null int64
RAD_2 506 non-null int64
RAD_3 506 non-null int64
CRIM 506 non-null float64
ZN 506 non-null float64
INDUS 506 non-null float64
NOX 506 non-null float64
RM 506 non-null float64
AGE 506 non-null float64
DIS 506 non-null float64
TAX 506 non-null float64
PTRATIO 506 non-null float64
B 506 non-null float64
LSTAT 506 non-null float64
dtypes: float64(11), int64(5)
memory usage: 63.3 KB
None
"""
def __init__(self, verbose=0, cols=None, drop_invariant=False, return_df=True, impute_missing=True, handle_unknown='impute'):
self.return_df = return_df
self.drop_invariant = drop_invariant
self.drop_cols = []
self.verbose = verbose
self.impute_missing = impute_missing
self.handle_unknown = handle_unknown
self.cols = cols
self.ordinal_encoder = None
self._dim = None
self.digits_per_col = {}
[docs] def fit(self, X, y=None, **kwargs):
"""Fit encoder according to X and y.
Parameters
----------
X : array-like, shape = [n_samples, n_features]
Training vectors, where n_samples is the number of samples
and n_features is the number of features.
y : array-like, shape = [n_samples]
Target values.
Returns
-------
self : encoder
Returns self.
"""
# if the input dataset isn't already a dataframe, convert it to one (using default column names)
# first check the type
X = convert_input(X)
self._dim = X.shape[1]
# if columns aren't passed, just use every string column
if self.cols is None:
self.cols = get_obj_cols(X)
# train an ordinal pre-encoder
self.ordinal_encoder = OrdinalEncoder(
verbose=self.verbose,
cols=self.cols,
impute_missing=self.impute_missing,
handle_unknown=self.handle_unknown
)
self.ordinal_encoder = self.ordinal_encoder.fit(X)
for col in self.cols:
self.digits_per_col[col] = self.calc_required_digits(X, col)
# drop all output columns with 0 variance.
if self.drop_invariant:
self.drop_cols = []
X_temp = self.transform(X)
self.drop_cols = [x for x in X_temp.columns.values if X_temp[x].var() <= 10e-5]
return self
[docs] def transform(self, X):
"""Perform the transformation to new categorical data.
Parameters
----------
X : array-like, shape = [n_samples, n_features]
Returns
-------
p : array, shape = [n_samples, n_numeric + N]
Transformed values with encoding applied.
"""
if self._dim is None:
raise ValueError('Must train encoder before it can be used to transform data.')
# first check the type
X = convert_input(X)
# then make sure that it is the right size
if X.shape[1] != self._dim:
raise ValueError('Unexpected input dimension %d, expected %d' % (X.shape[1], self._dim, ))
if not self.cols:
return X
X = self.ordinal_encoder.transform(X)
X = self.binary(X, cols=self.cols)
if self.drop_invariant:
for col in self.drop_cols:
X.drop(col, 1, inplace=True)
if self.return_df:
return X
else:
return X.values
[docs] def binary(self, X_in, cols=None):
"""
Binary encoding encodes the integers as binary code with one column per digit.
"""
X = X_in.copy(deep=True)
if cols is None:
cols = X.columns.values
pass_thru = []
else:
pass_thru = [col for col in X.columns.values if col not in cols]
bin_cols = []
for col in cols:
# get how many digits we need to represent the classes present
digits = self.digits_per_col[col]
# map the ordinal column into a list of these digits, of length digits
X[col] = X[col].map(lambda x: self.col_transform(x, digits))
for dig in range(digits):
X[str(col) + '_%d' % (dig, )] = X[col].map(lambda r: int(r[dig]) if r is not None else None)
bin_cols.append(str(col) + '_%d' % (dig, ))
X = X.reindex(columns=bin_cols + pass_thru)
return X
[docs] @staticmethod
def calc_required_digits(X, col):
"""
figure out how many digits we need to represent the classes present
"""
return int( np.ceil(np.log2(len(X[col].unique()))) )
[docs] @staticmethod
def col_transform(col, digits):
"""
The lambda body to transform the column values
"""
if col is None or float(col) < 0.0:
return None
else:
col = list("{0:b}".format(int(col)))
if len(col) == digits:
return col
else:
return [0 for _ in range(digits - len(col))] + col
Run Code Online (Sandbox Code Playgroud)
我的问题是为什么?我意识到这降低了输出数据的维数,但是这样做的背后逻辑是什么?
基本上,分类编码的问题是使您的算法处理分类特征。因此,有几种方法可用于执行此操作,包括二进制编码。实际上,如果您理解的话,它的逻辑就类似于“一次热编码(OHE)”的逻辑。
对于二进制编码,分类向量中的每个唯一标签都随机关联到(0)和(唯一标签数-1)之间的数字。现在,您可以将该数字编码为基数2,并通过新创建的列以0和1形式“抄录”先前的数字。例如,假设您的数据集为三个不同的标签:“ A”,“ B”和“ C”。
以下对应关系是随机生成的:
'A'-> 1-> 01;
'B'-> 2> 10;
'C'-> 0-> 00。
因此,给定数据集的编码示例为:
索引my_category enc_category_0 enc_category_1
0 A,1、0
1,B,0,1
2,C,0,0
3 A,1、0
关于它的效用,正如您所说,它降低了尺寸。此外,我认为它不会像OHE那样在编码列中包含太多零。这是一个有趣的帖子:https : //medium.com/data-design/visiting-categorical-features-and-encoding-in-decision-trees-53400fa65931
使用log2如何确定表示数据所需的位数?如果您了解工作原理,就会了解log2的用法。计算数字的log2会检索该数字的二进制编码所需的数字位数。示例:[log2(10)] = [3.32] = 4,二进制编码10需要4位数字。
有关实现和代码示例的更多信息,请访问:http : //contrib.scikit-learn.org/categorical-encoding/_modules/category_encoders/binary.html#BinaryEncoder
希望我很清楚
乔