如何在pandas中按对象分组滚动函数

use*_*396 6 python group-by apply dataframe pandas

我很难在数据帧中或者在groupby中解决回顾或翻转问题.

以下是我拥有的数据帧的简单示例:

              fruit    amount    
   20140101   apple     3
   20140102   apple     5
   20140102   orange    10
   20140104   banana    2
   20140104   apple     10
   20140104   orange    4
   20140105   orange    6
   20140105   grape     1
   …
   20141231   apple     3
   20141231   grape     2
Run Code Online (Sandbox Code Playgroud)

我需要计算每天前3天每种水果'量'的平均值,并创建以下数据框:

              fruit     average_in_last 3 days
   20140104   apple      4
   20140104   orange     10
   ...
Run Code Online (Sandbox Code Playgroud)

例如在20140104,前3天是20140101,20140102和20140103(注意数据框中的日期不连续且20140103不存在),苹果的平均数量是(3 + 5)/ 2 = 4和橙色是10/1 = 10,其余为0.

样本数据框非常简单,但实际数据框更复杂,更大.希望有人能对此有所了解,谢谢你提前!

dbo*_*ers 6

假设我们在开始时有一个这样的数据框,

>>> df
             fruit  amount
2017-06-01   apple       1
2017-06-03   apple      16
2017-06-04   apple      12
2017-06-05   apple       8
2017-06-06   apple      14
2017-06-08   apple       1
2017-06-09   apple       4
2017-06-02  orange      13
2017-06-03  orange       9
2017-06-04  orange       9
2017-06-05  orange       2
2017-06-06  orange      11
2017-06-07  orange       6
2017-06-08  orange       3
2017-06-09  orange       3
2017-06-10  orange      13
2017-06-02   grape      14
2017-06-03   grape      16
2017-06-07   grape       4
2017-06-09   grape      15
2017-06-10   grape       5

>>> dates = [i.date() for i in pd.date_range('2017-06-01', '2017-06-10')]

>>> temp = (df.groupby('fruit')['amount']
    .apply(lambda x: x.reindex(dates)  # fill in the missing dates for each group)
                      .fillna(0)   # fill each missing group with 0
                      .rolling(3)
                      .sum()) # do a rolling sum
    .reset_index()
    .rename(columns={'amount': 'sum_of_3_days', 
                     'level_1': 'date'}))  # rename date index to date col


>>> temp.head()
   fruit        date  amount
0  apple  2017-06-01     NaN
1  apple  2017-06-02     NaN
2  apple  2017-06-03    17.0
3  apple  2017-06-04    28.0
4  apple  2017-06-05    36.0

# converts the date index into date column 
>>> df = df.reset_index().rename(columns={'index': 'date'})  
>>> df.merge(temp, on=['fruit', 'date'])
>>> df
          date   fruit  amount  sum_of_3_days
0   2017-06-01   apple       1                NaN
1   2017-06-03   apple      16               17.0
2   2017-06-04   apple      12               28.0
3   2017-06-05   apple       8               36.0
4   2017-06-06   apple      14               34.0
5   2017-06-08   apple       1               15.0
6   2017-06-09   apple       4                5.0
7   2017-06-02  orange      13                NaN
8   2017-06-03  orange       9               22.0
9   2017-06-04  orange       9               31.0
10  2017-06-05  orange       2               20.0
11  2017-06-06  orange      11               22.0
12  2017-06-07  orange       6               19.0
13  2017-06-08  orange       3               20.0
14  2017-06-09  orange       3               12.0
15  2017-06-10  orange      13               19.0
16  2017-06-02   grape      14                NaN
17  2017-06-03   grape      16               30.0
18  2017-06-07   grape       4                4.0
19  2017-06-09   grape      15               19.0
20  2017-06-10   grape       5               20.0
Run Code Online (Sandbox Code Playgroud)


小智 5

我也想对groupby使用滚动,这就是为什么我登陆此页面,但是我认为我有一个比以前的建议更好的解决方法。

您可以执行以下操作:

pivoted_df = pd.pivot_table(df, index='date', columns='fruits', values='amount')
average_fruits = pivoted_df.rolling(window=3).mean().stack() 
Run Code Online (Sandbox Code Playgroud)

.stack()不是必需的,但会将您的数据透视表转换回常规df


Rom*_*kar 3

你可以这样做:

>>> df
>>>
           fruit  amount
20140101   apple       3
20140102   apple       5
20140102  orange      10
20140104  banana       2
20140104   apple      10
20140104  orange       4
20140105  orange       6
20140105   grape       1

>>> g= df.set_index('fruit', append=True).groupby(level=1)
>>> res = g['amount'].apply(pd.rolling_mean, 3, 1).reset_index('fruit')
>>> res

           fruit          0
20140101   apple   3.000000
20140102   apple   4.000000
20140102  orange  10.000000
20140104  banana   2.000000
20140104   apple   6.000000
20140104  orange   7.000000
20140105  orange   6.666667
20140105   grape   1.000000
Run Code Online (Sandbox Code Playgroud)

更新

好吧,正如 @cphlewis 在评论中提到的,我的代码不会给出你想要的结果。我检查了不同的方法,到目前为止我发现的方法是这样的(但不确定性能):

>>> df.index = [pd.to_datetime(str(x), format='%Y%m%d') for x in df.index]
>>> df.reset_index(inplace=True)
>>> def avg_3_days(x):
        return df[(df['index'] >= x['index'] - pd.DateOffset(3)) & (df['index'] < x['index']) & (df['fruit'] == x['fruit'])].amount.mean()

>>> df['res'] = df.apply(avg_3_days, axis=1)
>>> df

       index   fruit  amount  res
0 2014-01-01   apple       3  NaN
1 2014-01-02   apple       5    3
2 2014-01-02  orange      10  NaN
3 2014-01-04  banana       2  NaN
4 2014-01-04   apple      10    4
5 2014-01-04  orange       4   10
6 2014-01-05  orange       6    7
7 2014-01-05   grape       1  NaN
Run Code Online (Sandbox Code Playgroud)