Lau*_*ber 2 python aggregate mean pandas
我想对三列进行分组,然后找到在前三列中重复的所有行的第四个数字列的平均值。我可以通过以下功能实现这一点:
df2 = df.groupby(['col1', 'col2', 'col3'], as_index=False)['col4'].mean()
Run Code Online (Sandbox Code Playgroud)
问题是我还想要第五列,它将聚合由 groupby 函数分组的所有行,我不知道如何在前一个函数之上执行此操作。例如:
df
index col1 col2 col3 col4 col5
0 Week_1 James John 1 when and why?
1 Week_1 James John 3 How?
2 Week_2 James John 2 Do you know when?
3 Week_2 Mark Jim 3 What time?
4 Week_2 Andrew Simon 1 How far is it?
5 Week_2 Andrew Simon 2 Are you going?
CURRENT(with above function):
index col1 col2 col3 col4
0 Week_1 James John 2
1 Week_2 James John 2
2 Week_2 Mark Jim 3
3 Week_2 Andrew Simon 1.5
DESIRED:
index col1 col2 col3 col4 col5
0 Week_1 James John 2 when and why?, How?
2 Week_2 James John 2 Do you know when?
3 Week_2 Mark Jim 3 What time?
4 Week_2 Andrew Simon 1.5 How far is it?, Are you going?
Run Code Online (Sandbox Code Playgroud)
我在这里和这里都尝试过,但是我使用的 .mean() 函数使过程复杂化。任何帮助,将不胜感激。(如果可能,我想在聚合时指定一个自定义分隔符来分隔 col5 的字符串)。
您可以为每列聚合函数定义:
df2=df.groupby(['col1','col2','col3'], as_index=False).agg({'col4':'mean', 'col5':','.join})
print (df2)
col1 col2 col3 col4 col5
0 Week_1 James John 2.0 when and why?,How?
1 Week_2 Andrew Simon 1.5 How far is it?,Are you going?
2 Week_2 James John 2.0 Do you know when?
3 Week_2 Mark Jim 3.0 What time?
Run Code Online (Sandbox Code Playgroud)
一般解决方案是数字列聚合 bymean和其他 by join:
f = lambda x: x.mean() if np.issubdtype(x.dtype, np.number) else ', '.join(x)
df2 = df.groupby(['col1', 'col2', 'col3'], as_index=False).agg(f)
print (df2)
col1 col2 col3 col4 col5
0 Week_1 James John 2.0 when and why?, How?
1 Week_2 Andrew Simon 1.5 How far is it?, Are you going?
2 Week_2 James John 2.0 Do you know when?
3 Week_2 Mark Jim 3.0 What time?
Run Code Online (Sandbox Code Playgroud)