pandas:如何进行多个groupby-apply操作

Ana*_*sid 6 python group-by dataframe pandas pandas-groupby

我对R有更多的经验data.table,但我正在努力学习pandas.在data.table,我可以做这样的事情:

> head(dt_m)
   event_id           device_id longitude latitude               time_ category
1:  1004583 -100015673884079572        NA       NA 1970-01-01 06:34:52   1 free
2:  1004583 -100015673884079572        NA       NA 1970-01-01 06:34:52   1 free
3:  1004583 -100015673884079572        NA       NA 1970-01-01 06:34:52   1 free
4:  1004583 -100015673884079572        NA       NA 1970-01-01 06:34:52   1 free
5:  1004583 -100015673884079572        NA       NA 1970-01-01 06:34:52   1 free
6:  1004583 -100015673884079572        NA       NA 1970-01-01 06:34:52   1 free
                 app_id is_active
1: -5305696816021977482         0
2: -7164737313972860089         0
3: -8504475857937456387         0
4: -8807740666788515175         0
5:  5302560163370202064         0
6:  5521284031585796822         0


dt_m_summary <- dt_m[,
                     .(
                       mean_active = mean(is_active, na.rm = TRUE)
                       , median_lat = median(latitude, na.rm = TRUE)
                       , median_lon = median(longitude, na.rm = TRUE)
                       , mean_time = mean(time_)
                       , new_col = your_function(latitude, longitude, time_)
                     )
                     , by = list(device_id, category)
                     ]
Run Code Online (Sandbox Code Playgroud)

新的列(mean_active通过new_col),以及device_idcategory,会出现dt_m_summary.by如果我想要一个具有groupby-apply结果的新列,我也可以在原始表中进行类似的转换:

dt_m[, mean_active := mean(is_active, na.rm = TRUE), by = list(device_id, category)]

(如果我想要,例如,选择mean_active大于某个阈值的行,或做其他事情).

我知道有groupbypandas,但我还没有发现这样做的那种轻松转换为上述的一种方式.我能想到的最好的是做一系列的groupby-apply,然后将结果合并为一个dataframe,但这看起来非常笨重.有没有更好的方法呢?

piR*_*red 6

IIUC,使用groupbyagg.有关更多信息,请参阅文档.

df = pd.DataFrame(np.random.rand(10, 2),
                  pd.MultiIndex.from_product([list('XY'), range(5)]),
                  list('AB'))

df
Run Code Online (Sandbox Code Playgroud)

在此输入图像描述

df.groupby(level=0).agg(['sum', 'count', 'std'])
Run Code Online (Sandbox Code Playgroud)

在此输入图像描述


一个更加量身定制的例子是

# level=0 means group by the first level in the index
# if there is a specific column you want to group by
# use groupby('specific column name')
df.groupby(level=0).agg({'A': ['sum', 'std'],
                         'B': {'my_function': lambda x: x.sum() ** 2}})
Run Code Online (Sandbox Code Playgroud)

在此输入图像描述

注意dict传递给agg方法有按键'A''B'.这意味着,运行['sum', 'std']for'A'lambda x: x.sum() ** 2for的函数'B'(并标记它'my_function')

注2与你的有关new_column. agg要求传递的函数将列减少为标量.你最好在groupby/之前添加新列agg