Rub*_*ans 22 python sorting group-by count pandas
我有一个数据框的值形成一个文件,通过它我按两列分组,返回聚合的计数.现在我想按最大计数值排序,但是我收到以下错误:
KeyError:'count'
通过agg count列查看组是某种索引所以不知道如何做到这一点,我是Python和Panda的初学者.这是实际的代码,如果您需要更多详细信息,请与我们联系:
def answer_five():
df = census_df#.set_index(['STNAME'])
df = df[df['SUMLEV'] == 50]
df = df[['STNAME','CTYNAME']].groupby(['STNAME']).agg(['count']).sort(['count'])
#df.set_index(['count'])
print(df.index)
# get sorted count max item
return df.head(5)
Run Code Online (Sandbox Code Playgroud)
jez*_*ael 47
我认为你需要添加reset_index
,然后参数ascending=False
,sort_values
因为sort
返回:
FutureWarning:sort(columns = ....)已弃用,使用sort_values(by = .....).sort_values(['count'],ascending = False)
df = df[['STNAME','CTYNAME']].groupby(['STNAME'])['CTYNAME'] \
.count() \
.reset_index(name='count') \
.sort_values(['count'], ascending=False) \
.head(5)
Run Code Online (Sandbox Code Playgroud)
样品:
df = pd.DataFrame({'STNAME':list('abscscbcdbcsscae'),
'CTYNAME':[4,5,6,5,6,2,3,4,5,6,4,5,4,3,6,5]})
print (df)
CTYNAME STNAME
0 4 a
1 5 b
2 6 s
3 5 c
4 6 s
5 2 c
6 3 b
7 4 c
8 5 d
9 6 b
10 4 c
11 5 s
12 4 s
13 3 c
14 6 a
15 5 e
df = df[['STNAME','CTYNAME']].groupby(['STNAME'])['CTYNAME'] \
.count() \
.reset_index(name='count') \
.sort_values(['count'], ascending=False) \
.head(5)
print (df)
STNAME count
2 c 5
5 s 4
1 b 3
0 a 2
3 d 1
Run Code Online (Sandbox Code Playgroud)
但似乎你需要Series.nlargest
:
df = df[['STNAME','CTYNAME']].groupby(['STNAME'])['CTYNAME'].count().nlargest(5)
Run Code Online (Sandbox Code Playgroud)
要么:
df = df[['STNAME','CTYNAME']].groupby(['STNAME'])['CTYNAME'].size().nlargest(5)
Run Code Online (Sandbox Code Playgroud)
size
和之间的区别count
是:
样品:
df = pd.DataFrame({'STNAME':list('abscscbcdbcsscae'),
'CTYNAME':[4,5,6,5,6,2,3,4,5,6,4,5,4,3,6,5]})
print (df)
CTYNAME STNAME
0 4 a
1 5 b
2 6 s
3 5 c
4 6 s
5 2 c
6 3 b
7 4 c
8 5 d
9 6 b
10 4 c
11 5 s
12 4 s
13 3 c
14 6 a
15 5 e
df = df[['STNAME','CTYNAME']].groupby(['STNAME'])['CTYNAME']
.size()
.nlargest(5)
.reset_index(name='top5')
print (df)
STNAME top5
0 c 5
1 s 4
2 b 3
3 a 2
4 d 1
Run Code Online (Sandbox Code Playgroud)
Chr*_*anz 10
我不知道你的df究竟是怎么样的.但是,如果您必须按其计数对几个类别的频率进行排序,则可以更轻松地从df中对系列进行切片并对系列进行排序:
series = df.count().sort_values(ascending=False)
series.head()
Run Code Online (Sandbox Code Playgroud)
请注意,此系列将使用类别的名称作为索引!
一些现有的答案已经过时了。以下解决方案适用于列出列及其不同值的频率:
df = df[col].value_counts(ascending=False).reset_index()
Run Code Online (Sandbox Code Playgroud)