Emi*_*Chu 5 python duplicates pandas
假设我有以下数据帧(虽然我实际使用的是超过100行):
>> df
a b c d e
title0 1 0 0 string
title1 0 1 1 string
Run Code Online (Sandbox Code Playgroud)
对于每一行,我想:
输出应该是:
>> df
a b c d e
title0 1 0 0 string
title1 0 1 0 string
title1 0 0 1 string
Run Code Online (Sandbox Code Playgroud)
想法是使用get_dummies:
print (df)
a b c d e
0 title0 1 0 0 string1
1 title1 0 1 1 string2
2 title2 1 1 1 string3
3 title3 1 1 0 string4
#filter all columns without a and e
cols = df.columns.difference(['a','e'])
#or set columns names by list
#cols = ['b', 'c', 'd']
print (cols)
Index(['b', 'c', 'd'], dtype='object')
#filter columns and reshape to Series, filter only values by 1
s = df[cols].stack()
df1 = pd.get_dummies(s[s == 1].reset_index(level=1).drop(0, axis=1), prefix='', prefix_sep='')
print (df1)
b c d
0 1 0 0
1 0 1 0
1 0 0 1
2 1 0 0
2 0 1 0
2 0 0 1
3 1 0 0
3 0 1 0
Run Code Online (Sandbox Code Playgroud)
#last remove original columns, join new df and for same order use reindex
df = df.drop(cols, axis=1).join(df1).reindex(columns=df.columns).reset_index(drop=True)
print (df)
a b c d e
0 title0 1 0 0 string1
1 title1 0 1 0 string2
2 title1 0 0 1 string2
3 title2 1 0 0 string3
4 title2 0 1 0 string3
5 title2 0 0 1 string3
6 title3 1 0 0 string4
7 title3 0 1 0 string4
Run Code Online (Sandbox Code Playgroud)