从具有数字和名义数据的数据框:
>>> from pandas import pd
>>> d = {'m': {0: 'M1', 1: 'M2', 2: 'M7', 3: 'M1', 4: 'M2', 5: 'M1'},
'qj': {0: 'q23', 1: 'q4', 2: 'q9', 3: 'q23', 4: 'q23', 5: 'q9'},
'Budget': {0: 39, 1: 15, 2: 13, 3: 53, 4: 82, 5: 70}}
>>> df = pd.DataFrame.from_dict(d)
>>> df
Budget m qj
0 39 M1 q23
1 15 M2 q4
2 13 M7 q9
3 53 M1 q23
4 82 M2 q23
5 70 M1 q9
Run Code Online (Sandbox Code Playgroud)
get_dummies将分类变量转换为虚拟/指示变量:
>>> df_dummies = pd.get_dummies(df)
>>> df_dummies
Budget m_M1 m_M2 m_M7 qj_q23 qj_q4 qj_q9
0 39 1 0 0 1 0 0
1 15 0 1 0 0 1 0
2 13 0 0 1 0 0 1
3 53 1 0 0 1 0 0
4 82 0 1 0 1 0 0
5 70 1 0 0 0 0 1
Run Code Online (Sandbox Code Playgroud)
什么是从df_dummies回到df 最优雅的back_from_dummies方式?
>>> (back_from_dummies(df_dummies) == df).all()
Budget True
m True
qj True
dtype: bool
Run Code Online (Sandbox Code Playgroud)
idxmax
会很容易做到。
from itertools import groupby
def back_from_dummies(df):
result_series = {}
# Find dummy columns and build pairs (category, category_value)
dummmy_tuples = [(col.split("_")[0],col) for col in df.columns if "_" in col]
# Find non-dummy columns that do not have a _
non_dummy_cols = [col for col in df.columns if "_" not in col]
# For each category column group use idxmax to find the value.
for dummy, cols in groupby(dummmy_tuples, lambda item: item[0]):
#Select columns for each category
dummy_df = df[[col[1] for col in cols]]
# Find max value among columns
max_columns = dummy_df.idxmax(axis=1)
# Remove category_ prefix
result_series[dummy] = max_columns.apply(lambda item: item.split("_")[1])
# Copy non-dummy columns over.
for col in non_dummy_cols:
result_series[col] = df[col]
# Return dataframe of the resulting series
return pd.DataFrame(result_series)
(back_from_dummies(df_dummies) == df).all()
Run Code Online (Sandbox Code Playgroud)
归档时间: |
|
查看次数: |
1912 次 |
最近记录: |