pandas groupby:每组的前 3 个值

Ale*_*oca 2 python dataframe pandas pandas-groupby

pandas groupby中发布了一个新的更通用的问题:每个组中的前 3 个值并存储在 DataFrame 中,并且已在那里回答了一个可行的解决方案。

\n\n

在此示例中,我创建了一个数据框df,其中包含一些间隔 5 分钟的随机数据。\n我想创建一个数据框gdf分组 df),其中列出每小时的3 个最高值。

\n\n

即:从这一系列值

\n\n
                     VAL\nTIME                    \n2017-12-08 00:00:00   29\n2017-12-08 00:05:00   56\n2017-12-08 00:10:00   82\n2017-12-08 00:15:00   13\n2017-12-08 00:20:00   35\n2017-12-08 00:25:00   53\n2017-12-08 00:30:00   25\n2017-12-08 00:35:00   23\n2017-12-08 00:40:00   21\n2017-12-08 00:45:00   12\n2017-12-08 00:50:00   15\n2017-12-08 00:55:00    9\n2017-12-08 01:00:00   13\n2017-12-08 01:05:00   87\n2017-12-08 01:10:00    9\n2017-12-08 01:15:00   63\n2017-12-08 01:20:00   62\n2017-12-08 01:25:00   52\n2017-12-08 01:30:00   43\n2017-12-08 01:35:00   77\n2017-12-08 01:40:00   95\n2017-12-08 01:45:00   79\n2017-12-08 01:50:00   77\n2017-12-08 01:55:00    5\n2017-12-08 02:00:00   78\n2017-12-08 02:05:00   41\n2017-12-08 02:10:00   10\n2017-12-08 02:15:00   10\n2017-12-08 02:20:00   88\n
Run Code Online (Sandbox Code Playgroud)\n\n

\xe2\x80\x8b我非常接近解决方案,但我找不到最后一步的正确语法。我现在得到的(largest3)是:

\n\n
                                           VAL\nTIME                  TIME                    \n2017-12-08 00:00:00   2017-12-08 00:10:00   82\n                      2017-12-08 00:05:00   56\n                      2017-12-08 00:25:00   53\n2017-12-08 01:00:00   2017-12-08 01:40:00   95\n                      2017-12-08 01:05:00   87\n                      2017-12-08 01:45:00   79\n2017-12-08 02:00:00   2017-12-08 02:20:00   88\n                      2017-12-08 02:00:00   78\n                      2017-12-08 02:05:00   41\n
Run Code Online (Sandbox Code Playgroud)\n\n

我想从中获得这个gdf(达到每个最大值的时间并不重要):

\n\n
                    VAL1  VAL2  VAL3\n TIME                \n2017-12-08 00:00:00   82    56    53\n2017-12-08 01:00:00   95    87    79\n2017-12-08 02:00:00   88    78    41\n
Run Code Online (Sandbox Code Playgroud)\n\n

这是代码:

\n\n
import pandas as pd\nfrom datetime import *\nimport numpy as np\n\n# test data\ndf = pd.DataFrame()\ndate_ref = datetime(2017,12,8,0,0,0)\ndays = pd.date_range(date_ref, date_ref + timedelta(0.1), freq=\'5min\')\nnp.random.seed(seed=1111)\ndata1 = np.random.randint(1, high=100, size=len(days))\ndf = pd.DataFrame({\'TIME\': days, \'VAL\': data1})\ndf = df.set_index(\'TIME\')\nprint(df)\nprint("----")\n\n# groupby\ngroup1 = df.groupby(pd.Grouper(freq=\'1H\'))\nlargest3 = pd.DataFrame(group1[\'VAL\'].nlargest(3))\nprint(largest3)\n\ngdf = pd.DataFrame()\n# ???? <-------------------\n
Run Code Online (Sandbox Code Playgroud)\n\n

先感谢您。

\n

Max*_*axU 5

注意:此解决方案仅在每组至少有 3 行时才有效

尝试以下方法:

In [59]: x = (df.groupby(pd.Grouper(freq='H'))['VAL']
                .apply(lambda x: x.nlargest(3))
                .reset_index(level=1, drop=True)
                .to_frame('VAL'))

In [60]: x
Out[60]:
                     VAL
TIME
2017-12-08 00:00:00   82
2017-12-08 00:00:00   56
2017-12-08 00:00:00   53
2017-12-08 01:00:00   95
2017-12-08 01:00:00   87
2017-12-08 01:00:00   79
2017-12-08 02:00:00   88
2017-12-08 02:00:00   78
2017-12-08 02:00:00   41

In [61]: x.set_index(np.arange(len(x)) % 3, append=True)['VAL'].unstack().add_prefix('VAL')
Out[61]:
                     VAL0  VAL1  VAL2
TIME
2017-12-08 00:00:00    82    56    53
2017-12-08 01:00:00    95    87    79
2017-12-08 02:00:00    88    78    41
Run Code Online (Sandbox Code Playgroud)

一些解释:

In [94]: x.set_index(np.arange(len(x)) % 3, append=True)
Out[94]:
                       VAL
TIME
2017-12-08 00:00:00 0   82
                    1   56
                    2   53
2017-12-08 01:00:00 0   95
                    1   87
                    2   79
2017-12-08 02:00:00 0   88
                    1   78
                    2   41

In [95]: x.set_index(np.arange(len(x)) % 3, append=True)['VAL'].unstack()
Out[95]:
                      0   1   2
TIME
2017-12-08 00:00:00  82  56  53
2017-12-08 01:00:00  95  87  79
2017-12-08 02:00:00  88  78  41
Run Code Online (Sandbox Code Playgroud)