聚合行 Pandas

Ste*_*zzi 5 python dataframe pandas pandas-groupby

我对pandas. 我需要汇总'Names'它们是否具有相同的名称,然后为'Rating''NumsHelpful'(不计算NaN)求平均值。'Review'应该被连接,而'Weight(Pounds)'应该保持不变:

col names: ['Brand', 'Name', 'NumsHelpful', 'Rating', 'Weight(Pounds)', 'Review']

Name             'Brand'                             'Name'
1534             Zing Zang                Zing Zang Bloody Mary Mix, 32 fl oz   
1535             Zing Zang                Zing Zang Bloody Mary Mix, 32 fl oz   
1536             Zing Zang                Zing Zang Bloody Mary Mix, 32 fl oz   
1537             Zing Zang                Zing Zang Bloody Mary Mix, 32 fl oz   
1538             Zing Zang                Zing Zang Bloody Mary Mix, 32 fl oz   
1539             Zing Zang                Zing Zang Bloody Mary Mix, 32 fl oz   
1540             Zing Zang                Zing Zang Bloody Mary Mix, 32 fl oz   

        'NumsHelpful'     'Rating'       'Weight'
1534          NaN            2              4.5   
1535          NaN            2              4.5   
1536          NaN            NaN            4.5   
1537          NaN            NaN            4.5   
1538          2              NaN            4.5   
1539          3              5              4.5   
1540          5              NaN            4.5   

                        'Review'
1534                                     Yummy - Delish  
1535  The best Bloody Mary mix! - The best Bloody Ma...  
1536  Best Taste by far - I've tried several if not ...  
1537  Best bloody mary mix ever - This is also good ...  
1538  Outstanding - Has a small kick to it but very ...  
1539   OMG! So Good! - Spicy, terrific Bloody Mary mix!  
1540                      Good stuff - This is the best  
Run Code Online (Sandbox Code Playgroud)

所以输出应该是这样的:

 'Brand'                'Name'                   'NumsHelpful'    'Rating' 
Zing Zang    Zing Zang Bloody Mary Mix, 32 fl oz     3.33             3

 'Weight'               'Review'
   4.5      Review1 / Review2 / ... / ReviewN
Run Code Online (Sandbox Code Playgroud)

我该如何进行?谢谢。

jez*_*ael 9

使用DataFrameGroupBy.agg的列的字典和聚合函数-列Weight,并Brand通过agregated first-这意味着每个组第一值:

d = {'NumsHelpful':'mean', 
     'Review':'/'.join, 
     'Weight':'first',
     'Brand':'first', 
     'Rating':'mean'}
df = df.groupby('Name').agg(d).reset_index()
print (df)
                                  Name  NumsHelpful  \
0  Zing Zang Bloody Mary Mix, 32 fl oz     3.333333   

                                              Review  Weight      Brand  \
0  Yummy - Delish/The best Bloody Mary mix! - The...     4.5  Zing Zang   

   Rating  
0     3.0  
Run Code Online (Sandbox Code Playgroud)

同样在熊猫 0.23.1 熊猫版本中获得:

FutureWarning: 'Name' 既是索引级别又是列标签。默认为列,但这会在未来版本中引发歧义错误

解决方案是删除索引名称Name

df.index.name = None
Run Code Online (Sandbox Code Playgroud)

或者:

df = df.rename_axis(None)
Run Code Online (Sandbox Code Playgroud)

另一种可能的解决方案不是由 聚合first,而是将这些列添加到groupby

d = {'NumsHelpful':'mean',  'Review':'/'.join, 'Rating':'mean'}
df = df.groupby(['Name', 'Weight','Brand']).agg(d).reset_index()
Run Code Online (Sandbox Code Playgroud)

如果每组有相同的值,两种解决方案都会返回相同的输出。

编辑:

如果需要将字符串(对象)列转换为数字,请先尝试通过astype以下方式转换:

df['Weight(Pounds)'] = df['Weight(Pounds)'].astype(float)
Run Code Online (Sandbox Code Playgroud)

如果它使用to_numeric参数errors='coerce'将不可解析的字符串转换为NaNs失败:

df['Weight(Pounds)'] = pd.to_numeric(df['Weight(Pounds)'], errors='coerce')
Run Code Online (Sandbox Code Playgroud)