pandas在resample/groupby中聚合列表

Lud*_*ica 3 python group-by dataframe pandas pandas-groupby

我有一个数据框,其中每个实例都有一个时间戳,一个id和一个数字列表,如下所示:

timestamp           | id | lists
----------------------------------
2016-01-01 00:00:00 | 1  | [2, 10]
2016-01-01 05:00:00 | 1  | [9, 10, 3, 5]
2016-01-01 10:00:00 | 1  | [1, 10, 5]
2016-01-02 01:00:00 | 1  | [2, 6, 7]
2016-01-02 04:00:00 | 1  | [2, 6]
2016-01-01 02:00:00 | 2  | [0]
2016-01-01 08:00:00 | 2  | [10, 3, 2]
2016-01-01 14:00:00 | 2  | [0, 9, 3]
2016-01-02 03:00:00 | 2  | [0, 9, 2]
Run Code Online (Sandbox Code Playgroud)

对于每个我想要在白天重新采样的ID(这很容易)并连接在同一天发生的所有 实例列表.Resample + concat/sum不起作用,因为resample会删除所有非数字列(请参阅此处)

我想写类似的东西:

daily_data = data.groupby('id').resample('1D').concatenate() # .concatenate() does not exist
Run Code Online (Sandbox Code Playgroud)

所需结果:

timestamp  | id | lists
----------------------------------
2016-01-01 | 1  | [2, 10, 9, 10, 3, 5, 1, 10, 5]
2016-01-02 | 1  | [2, 6, 7, 2, 6]
2016-01-01 | 2  | [0, 10, 3, 2]
2016-01-02 | 2  | [0, 9, 3, 0, 9, 2]
Run Code Online (Sandbox Code Playgroud)

在这里,您可以复制生成我用于描述的输入的脚本:

import pandas as pd 
from random import randint

time = pd.to_datetime( ['2016-01-01 00:00:00', '2016-01-01 05:00:00', 
                        '2016-01-01 10:00:00', '2016-01-02 01:00:00', 
                        '2016-01-02 04:00:00', '2016-01-01 02:00:00', 
                        '2016-01-01 08:00:00', '2016-01-01 14:00:00',
                        '2016-01-02 03:00:00' ]
                      )

id_1 = [1] * 5
id_2 = [2] * 4

lists = [0] * 9
for i in range(9):
    l = [randint(0,10)  for _ in range(randint(1,5) ) ]
    l = list(set(l))
    lists[i] = l

data = {'timestamp': time, 'id': id_1 + id_2, 'lists': lists}

example = pd.DataFrame(data=data)
Run Code Online (Sandbox Code Playgroud)

如果有可能选择删除连接列表中的重复项,则可以获得奖励积分.

piR*_*red 5

正如@jezrael所指出的,这只适用于pandas版本0.18.1+

  • set_index'timestamp',预备以后resample
  • groupby 'id'列和选择lists
  • 之后resample,sum列表将连接它们
  • reset_index 以正确的顺序获取列

df.set_index('timestamp').groupby('id').lists.resample('D').sum() \
  .reset_index('id').reset_index()
Run Code Online (Sandbox Code Playgroud)

在此输入图像描述