J.S*_*.P. 3 python dataframe pandas
我有一个包含三列的熊猫数据框:
a b c
Donaldson Minnesota 2020
Ozuna Atlanta 2020
Betts Boston 2019
Donaldson Atlanta 2019
Ozuna St. Louis 2019
Torres New York 2019
Run Code Online (Sandbox Code Playgroud)
我想识别具有多个列 c 值的所有列名称,然后将所有列 b 实例替换为数据框中的第一个值,如下所示:
a b c
Donaldson Minnesota 2020
Ozuna Atlanta 2020
Betts Boston 2019
Donaldson Minnesota 2019
Ozuna Atlanta 2019
Torres New York 2019
Run Code Online (Sandbox Code Playgroud)
这绝对是低效的,但这是我迄今为止尝试过的:
# get a df of just names and cities and deduplicate
df_names = df[['a','b']].drop_duplicates()
# find any multiple column b values and put them in a list
a_matches = pd.Dataframe(df_names.groupby('a')['b'].nunique())
multi_b = a_matches.index[a_matches['b'] > 1].tolist()
Run Code Online (Sandbox Code Playgroud)
这给了我 ['Donaldson','Ozuna'],但现在我被卡住了。我想不出一个好方法来为 c 中的相应值生成替换字典。我认为必须有一种更优雅的方式来解决这个问题。
IIUC,您可以尝试groupby+transform使用np.where:
g = df.groupby('a')
c = g['c'].transform('nunique').gt(1) # column a names that have >1 column c value
df['b'] = np.where(c,g['b'].transform('first'),df['b'])
# for a new df: new = df.assign(b=np.where(c,g['b'].transform('first'),df['b']))
Run Code Online (Sandbox Code Playgroud)
print(df)
a b c
0 Donaldson Minnesota 2020
1 Ozuna Atlanta 2020
2 Betts Boston 2019
3 Donaldson Minnesota 2019
4 Ozuna Atlanta 2019
5 Torres New York 2019
Run Code Online (Sandbox Code Playgroud)
对于@ALloz 正确指出的给定示例,您可以使用:
df['b'] = df.groupby('a')['b'].transform('first')
print(df)
Run Code Online (Sandbox Code Playgroud)
a b c
0 Donaldson Minnesota 2020
1 Ozuna Atlanta 2020
2 Betts Boston 2019
3 Donaldson Minnesota 2019
4 Ozuna Atlanta 2019
5 Torres New York 2019
Run Code Online (Sandbox Code Playgroud)