mar*_*rin 2 python dataframe python-3.x pandas pandas-groupby
我正在尝试找到具有最大编号的月份(列'月')(在DepDelay列中)
数据
flightID Month ArrTime ActualElapsedTime DepDelay ArrDelay
BBYYEUVY67527 1 1514.0 58.0 NA 64.0
MUPXAQFN40227 1 37.0 120.0 13 52.0
LQLYUIMN79169 1 916.0 166.0 NA -25.0
KTAMHIFO10843 1 NaN NaN 5 NaN
BOOXJTEY23623 1 NaN NaN 4 NaN
BBYYEUVY67527 2 1514.0 58.0 NA 64.0
MUPXAQFN40227 2 37.0 120.0 NA 52.0
LQLYUIMN79169 2 916.0 166.0 NA -25.0
KTAMHIFO10843 2 NaN NaN 15 NaN
BOOXJTEY23623 2 NaN NaN 4 NaN
Run Code Online (Sandbox Code Playgroud)
我试过了:
data = pd.read_csv('data.csv', sep='\t')
dep_delay = all_data.groupby(["Month"].DepDelay.count().max())
print(dep_delay)
Run Code Online (Sandbox Code Playgroud)
错误:
AttributeError Traceback (most recent call last)
<ipython-input-14-2ea6213009d6> in <module>()
----> 1 dep_delay = all_data.groupby(["Month"].DepDelay.count().max())
2
3 print(dep_delay)
AttributeError: 'list' object has no attribute 'DepDelay'
Run Code Online (Sandbox Code Playgroud)
好的输出:
Month DepDelay
1 22
Run Code Online (Sandbox Code Playgroud)
您需要的sum
不是count
按组对值求和.这是使用GroupBy
+ 的一种方式sum
,然后idxmax
:
res = df.groupby('Month')['DepDelay'].sum().reset_index()
res = res.loc[[res['DepDelay'].idxmax()]]
print(res)
Month DepDelay
0 1 22.0
Run Code Online (Sandbox Code Playgroud)
或者,您可以对组进行分组和排序,然后提取第一行:
res = df.groupby('Month')['DepDelay'].sum()\
.sort_values(ascending=False).head(1)\
.reset_index()
print(res)
Month DepDelay
0 1 22.0
Run Code Online (Sandbox Code Playgroud)