Pandas Groupby 在日期时间列上滚动多列之和

Question

Pandas Groupby 在日期时间列上滚动多列之和

我试图按组获取多个列的滚动总和，在日期时间列上滚动（即在指定的时间间隔内）。滚动一列似乎工作正常，但是当我通过矢量化滚动多列时，我得到了意想不到的结果。

我的第一次尝试：

df = pd.DataFrame({"column1": range(6), 
                   "column2": range(6), 
                   'group': 3*['A','B'], 
                   'date':pd.date_range("20190101", periods=6)})

(df.groupby('group').rolling("1d", on='date')['column1'].sum()).groupby('group').shift(fill_value=0)

# output:
group  date      
A      2019-01-01    0.0
       2019-01-03    0.0
       2019-01-05    2.0
B      2019-01-02    0.0
       2019-01-04    1.0
       2019-01-06    3.0
Name: column1, dtype: float64

Run Code Online (Sandbox Code Playgroud)

上面产生了预期的结果，但是我在此过程中丢失了原始索引。由于在我的数据中，某些日期是相同的，因此我必须在组+日期上连接回原始数据框，这是低效的。因此，我应用了以下方法来避免这种情况并保留原始索引：

df.groupby('group').apply(lambda x: x.rolling("1d", on='date')['column1'].sum().shift(fill_value=0))

# output:
group   
A      0    0.0
       2    0.0
       4    2.0
B      1    0.0
       3    1.0
       5    3.0
Name: column1, dtype: float64

Run Code Online (Sandbox Code Playgroud)

这样，我可以通过对索引进行排序，轻松地将其分配给原始 df 的新列。现在我想对“column2”重复相同的操作，并通过矢量化来完成此操作。然而，得到的结果却出乎我的意料：

df.groupby('group').apply(lambda x: x.rolling("1d", on='date')[['column1','column2']].sum().shift(fill_value=0))

# output:

   column1  column2       date
0      0.0      0.0 1970-01-01
1      0.0      0.0 1970-01-01
2      0.0      0.0 2019-01-01
3      1.0      1.0 2019-01-02
4      2.0      2.0 2019-01-03
5      3.0      3.0 2019-01-04

Run Code Online (Sandbox Code Playgroud)

结果是正确的，但由于以下原因而出乎意料：（1）groupby 中的 group_keys 被忽略（2）它像“transform”方法一样自动对结果进行排序并重置索引。

我想了解为什么会发生这种情况，以及是否有其他方法可以实现上述结果。

Answer 1

dav*_*lla 0

我采用了你原来的方法并做了一些改变。你能检查一下这是否是你想要的吗？

重置原始数据框的索引，并为原始索引分配列名。

df = df.reset_index().rename(columns={df.index.name: 'index'})

Run Code Online (Sandbox Code Playgroud)

现在，您拥有相同的原始数据框，但它有一个附加列，称为index原始索引。

将应用于按和列分组的数据框rolling上的 2 列和。groupbygroupindexcolumn1column2

(df.groupby(['group', 'index']).rolling("1d", on='date')[['column1', 'column2']].sum()).groupby('group').shift(fill_value=0)

Run Code Online (Sandbox Code Playgroud)

结果：

                        column1  column2
group index date                        
A     0     2019-01-01      0.0      0.0
      2     2019-01-03      0.0      0.0
      4     2019-01-05      2.0      2.0
B     1     2019-01-02      0.0      0.0
      3     2019-01-04      1.0      1.0
      5     2019-01-06      3.0      3.0

Run Code Online (Sandbox Code Playgroud)

如果要恢复原来的索引，请重置多重索引并将“index”设置为索引

(df.groupby(['group', 'index']).rolling("1d", on='date')[['column1', 'column2']].sum()).groupby('group').shift(fill_value=0).reset_index().set_index('index')

Run Code Online (Sandbox Code Playgroud)

结果：

      group       date  column1  column2
index                                   
0         A 2019-01-01      0.0      0.0
2         A 2019-01-03      0.0      0.0
4         A 2019-01-05      2.0      2.0
1         B 2019-01-02      0.0      0.0
3         B 2019-01-04      1.0      1.0
5         B 2019-01-06      3.0      3.0

Run Code Online (Sandbox Code Playgroud)

.sort_index()如果你想排序的话添加一个

      group       date  column1  column2
index                                   
0         A 2019-01-01      0.0      0.0
1         B 2019-01-02      0.0      0.0
2         A 2019-01-03      0.0      0.0
3         B 2019-01-04      1.0      1.0
4         A 2019-01-05      2.0      2.0
5         B 2019-01-06      3.0      3.0

Run Code Online (Sandbox Code Playgroud)

希望这可以帮助！如果我遗漏了什么，请告诉我。

归档时间：	5 年，7 月前
查看次数：	2476 次
最近记录：	2 年，7 月前