gab*_*how 10 python pandas categorical-data
嗨,有一个df包含分类变量的熊猫数据框。
df=pandas.DataFrame(data=[['male','blue'],['female','brown'],
['male','black']],columns=['gender','eyes'])
df
Out[16]:
gender eyes
0 male blue
1 female brown
2 male black
Run Code Online (Sandbox Code Playgroud)
使用函数 get_dummies 我得到以下数据帧
df_dummies = pandas.get_dummies(df)
df_dummies
Out[18]:
gender_female gender_male eyes_black eyes_blue eyes_brown
0 0 1 0 1 0
1 1 0 0 0 1
2 0 1 1 0 0
Run Code Online (Sandbox Code Playgroud)
Owever 列gender_female并gender_male包含相同的信息,因为原始列可以采用二进制值。有没有(智能)方法只保留两列中的一列?
更新
指某东西的用途
df_dummies = pandas.get_dummies(df,drop_first=True)
Run Code Online (Sandbox Code Playgroud)
会给我
df_dummies
Out[21]:
gender_male eyes_blue eyes_brown
0 1 1 0
1 0 0 1
2 1 0 0
Run Code Online (Sandbox Code Playgroud)
但我想删除最初只有两种可能性的列
想要的结果应该是
df_dummies
Out[18]:
gender_male eyes_black eyes_blue eyes_brown
0 1 0 1 0
1 0 0 0 1
2 1 1 0 0
Run Code Online (Sandbox Code Playgroud)
Joe*_*Joe 14
是的,您可以使用参数dropfirst:
drop_first=True
Run Code Online (Sandbox Code Playgroud)
从文档:
pd.get_dummies(pd.Series(list('abcaa')), drop_first=True)
b c
0 0 0
1 1 0
2 0 1
3 0 0
4 0 0
Run Code Online (Sandbox Code Playgroud)
要让所有虚拟列都为eyes,一个为gender,请使用:
df = pd.get_dummies(df, prefix=['eyes'], columns=['eyes'])
df = pd.get_dummies(df,drop_first=True)
Run Code Online (Sandbox Code Playgroud)
输出:
eyes_black eyes_blue eyes_brown gender_male
0 0 1 0 1
1 0 0 1 0
2 1 0 0 1
Run Code Online (Sandbox Code Playgroud)
更一般:
gender eyes heigh
0 male blue tall
1 female brown short
2 male black average
for i in df.columns:
if len(df.groupby([i]).size()) > 2:
df = pd.get_dummies(df, prefix=[i], columns=[i])
df = pd.get_dummies(df, drop_first=True)
Run Code Online (Sandbox Code Playgroud)
输出:
eyes_black eyes_blue eyes_brown heigh_average heigh_short heigh_tall \
0 0 1 0 0 0 1
1 0 0 1 0 1 0
2 1 0 0 1 0 0
gender_male
0 1
1 0
2 1
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
19983 次 |
| 最近记录: |