Ale*_*oca 2 python dataframe pandas pandas-groupby
pandas groupby中发布了一个新的更通用的问题:每个组中的前 3 个值并存储在 DataFrame 中,并且已在那里回答了一个可行的解决方案。
\n\n在此示例中,我创建了一个数据框df
,其中包含一些间隔 5 分钟的随机数据。\n我想创建一个数据框gdf
(分组 df),其中列出每小时的3 个最高值。
即:从这一系列值
\n\n VAL\nTIME \n2017-12-08 00:00:00 29\n2017-12-08 00:05:00 56\n2017-12-08 00:10:00 82\n2017-12-08 00:15:00 13\n2017-12-08 00:20:00 35\n2017-12-08 00:25:00 53\n2017-12-08 00:30:00 25\n2017-12-08 00:35:00 23\n2017-12-08 00:40:00 21\n2017-12-08 00:45:00 12\n2017-12-08 00:50:00 15\n2017-12-08 00:55:00 9\n2017-12-08 01:00:00 13\n2017-12-08 01:05:00 87\n2017-12-08 01:10:00 9\n2017-12-08 01:15:00 63\n2017-12-08 01:20:00 62\n2017-12-08 01:25:00 52\n2017-12-08 01:30:00 43\n2017-12-08 01:35:00 77\n2017-12-08 01:40:00 95\n2017-12-08 01:45:00 79\n2017-12-08 01:50:00 77\n2017-12-08 01:55:00 5\n2017-12-08 02:00:00 78\n2017-12-08 02:05:00 41\n2017-12-08 02:10:00 10\n2017-12-08 02:15:00 10\n2017-12-08 02:20:00 88\n
Run Code Online (Sandbox Code Playgroud)\n\n\xe2\x80\x8b我非常接近解决方案,但我找不到最后一步的正确语法。我现在得到的(largest3
)是:
VAL\nTIME TIME \n2017-12-08 00:00:00 2017-12-08 00:10:00 82\n 2017-12-08 00:05:00 56\n 2017-12-08 00:25:00 53\n2017-12-08 01:00:00 2017-12-08 01:40:00 95\n 2017-12-08 01:05:00 87\n 2017-12-08 01:45:00 79\n2017-12-08 02:00:00 2017-12-08 02:20:00 88\n 2017-12-08 02:00:00 78\n 2017-12-08 02:05:00 41\n
Run Code Online (Sandbox Code Playgroud)\n\n我想从中获得这个gdf
(达到每个最大值的时间并不重要):
VAL1 VAL2 VAL3\n TIME \n2017-12-08 00:00:00 82 56 53\n2017-12-08 01:00:00 95 87 79\n2017-12-08 02:00:00 88 78 41\n
Run Code Online (Sandbox Code Playgroud)\n\n这是代码:
\n\nimport pandas as pd\nfrom datetime import *\nimport numpy as np\n\n# test data\ndf = pd.DataFrame()\ndate_ref = datetime(2017,12,8,0,0,0)\ndays = pd.date_range(date_ref, date_ref + timedelta(0.1), freq=\'5min\')\nnp.random.seed(seed=1111)\ndata1 = np.random.randint(1, high=100, size=len(days))\ndf = pd.DataFrame({\'TIME\': days, \'VAL\': data1})\ndf = df.set_index(\'TIME\')\nprint(df)\nprint("----")\n\n# groupby\ngroup1 = df.groupby(pd.Grouper(freq=\'1H\'))\nlargest3 = pd.DataFrame(group1[\'VAL\'].nlargest(3))\nprint(largest3)\n\ngdf = pd.DataFrame()\n# ???? <-------------------\n
Run Code Online (Sandbox Code Playgroud)\n\n先感谢您。
\n注意:此解决方案仅在每组至少有 3 行时才有效
尝试以下方法:
In [59]: x = (df.groupby(pd.Grouper(freq='H'))['VAL']
.apply(lambda x: x.nlargest(3))
.reset_index(level=1, drop=True)
.to_frame('VAL'))
In [60]: x
Out[60]:
VAL
TIME
2017-12-08 00:00:00 82
2017-12-08 00:00:00 56
2017-12-08 00:00:00 53
2017-12-08 01:00:00 95
2017-12-08 01:00:00 87
2017-12-08 01:00:00 79
2017-12-08 02:00:00 88
2017-12-08 02:00:00 78
2017-12-08 02:00:00 41
In [61]: x.set_index(np.arange(len(x)) % 3, append=True)['VAL'].unstack().add_prefix('VAL')
Out[61]:
VAL0 VAL1 VAL2
TIME
2017-12-08 00:00:00 82 56 53
2017-12-08 01:00:00 95 87 79
2017-12-08 02:00:00 88 78 41
Run Code Online (Sandbox Code Playgroud)
一些解释:
In [94]: x.set_index(np.arange(len(x)) % 3, append=True)
Out[94]:
VAL
TIME
2017-12-08 00:00:00 0 82
1 56
2 53
2017-12-08 01:00:00 0 95
1 87
2 79
2017-12-08 02:00:00 0 88
1 78
2 41
In [95]: x.set_index(np.arange(len(x)) % 3, append=True)['VAL'].unstack()
Out[95]:
0 1 2
TIME
2017-12-08 00:00:00 82 56 53
2017-12-08 01:00:00 95 87 79
2017-12-08 02:00:00 88 78 41
Run Code Online (Sandbox Code Playgroud)