按一列中出现的频率对整个csv进行排序

jen*_*ryb 5 python csv sorting frequency pandas

我有一个大的csv文件,它是调用者数据的日志.

我文件的简短片段:

CompanyName    High Priority     QualityIssue
Customer1         Yes             User
Customer1         Yes             User
Customer2         No              User
Customer3         No              Equipment
Customer1         No              Neither
Customer3         No              User
Customer3         Yes             User
Customer3         Yes             Equipment
Customer4         No              User
Run Code Online (Sandbox Code Playgroud)

我想按照客户出现的频率对整个列表进行排序,这样就像:

CompanyName    High Priority     QualityIssue
Customer3         No               Equipment
Customer3         No               User
Customer3         Yes              User
Customer3         Yes              Equipment
Customer1         Yes              User
Customer1         Yes              User
Customer1         No               Neither
Customer2         No               User
Customer4         No               User
Run Code Online (Sandbox Code Playgroud)

我尝试过groupby,但是只打印出公司名称和频率而不是其他列,我也试过了

df['Totals']= [sum(df['CompanyName'] == df['CompanyName'][i]) for i in xrange(len(df))]
Run Code Online (Sandbox Code Playgroud)

df = [sum(df['CompanyName'] == df['CompanyName'][i]) for i in xrange(len(df))]
Run Code Online (Sandbox Code Playgroud)

但这些给了我错误:ValueError:错误的项目数量传递1,索引意味着24

我看过这样的事情:

for key, value in sorted(mydict.iteritems(), key=lambda (k,v): (v,k)):
    print "%s: %s" % (key, value)
Run Code Online (Sandbox Code Playgroud)

但这只打印出两列,我想整理我的整个csv.我的输出应该是我的整个csv按第一列排序.

我在这里先向您的帮助表示感谢!

EdC*_*ica 8

这似乎做你想要的,基本上通过执行a groupbytransformwith 添加一个count列value_counts然后你可以对该列进行排序:

In [22]:

df['count'] = df.groupby('CompanyName')['CompanyName'].transform(pd.Series.value_counts)
df.sort('count', ascending=False)
Out[22]:
  CompanyName HighPriority QualityIssue count
5   Customer3           No         User     4
3   Customer3           No    Equipment     4
7   Customer3          Yes    Equipment     4
6   Customer3          Yes         User     4
0   Customer1          Yes         User     3
4   Customer1           No      Neither     3
1   Customer1          Yes         User     3
8   Customer4           No         User     1
2   Customer2           No         User     1
Run Code Online (Sandbox Code Playgroud)

您可以使用df.drop以下方法删除无关列:

In [24]:
df.drop('count', axis=1)

Out[24]:
  CompanyName HighPriority QualityIssue
5   Customer3           No         User
3   Customer3           No    Equipment
7   Customer3          Yes    Equipment
6   Customer3          Yes         User
0   Customer1          Yes         User
4   Customer1           No      Neither
1   Customer1          Yes         User
8   Customer4           No         User
2   Customer2           No         User
Run Code Online (Sandbox Code Playgroud)


小智 6

2021 年更新

EdChumIlya K.提出的答案不再有效。


该函数pd.Series.value_counts返回一个包含唯一值计数的系列。但是我们应用该pd.Series.value_counts函数的 Series本身只包含一个唯一值,因为我们之前应用groupby到了 DataFrame 并将CompanyName Series拆分为一组唯一值。因此,我们应用该函数后的最终输出将如下所示。

Customer3        4
dtype: int64
Run Code Online (Sandbox Code Playgroud)

这是无稽之谈,我们不能将系列中的值转换为整个系列。不知何故,我们只需要整数4而不是整个系列。


但是,我们可以groupby通过计算每个组中的值的数量,将整个组转换为该组中的值的数量,并将它们放在一起形成最终的频率序列,从而更早地利用该函数。

我们可以更换pd.Series.value_countspd.Series.count或者只是简单地使用功能名称count

import pandas as pd

df = pd.DataFrame({'CompanyName': {0: 'Customer1', 1: 'Customer1', 2: 'Customer2', 3: 'Customer3', 4: 'Customer1', 5: 'Customer3', 6: 'Customer3', 7: 'Customer3', 8: 'Customer4'}, 'HighPriority': {0: 'Yes', 1: 'Yes', 2: 'No', 3: 'No', 4: 'No', 5: 'No', 6: 'Yes', 7: 'Yes', 8: 'No'}, 'QualityIssue': {0: 'User', 1: 'User', 2: 'User', 3: 'Equipment', 4: 'Neither', 5: 'User', 6: 'User', 7: 'Equipment', 8: 'User'}})

df['Frequency'] = df.groupby('CompanyName')['CompanyName'].transform('count')
df.sort_values('Frequency', inplace=True, ascending=False)
Run Code Online (Sandbox Code Playgroud)

输出

>>> df

  CompanyName HighPriority QualityIssue  Frequency
3   Customer3           No    Equipment          4
5   Customer3           No         User          4
6   Customer3          Yes         User          4
7   Customer3          Yes    Equipment          4
0   Customer1          Yes         User          3
1   Customer1          Yes         User          3
4   Customer1           No      Neither          3
2   Customer2           No         User          1
8   Customer4           No         User          1
Run Code Online (Sandbox Code Playgroud)


Ily*_* K. 5

得票最多的答案需要一个不起眼的新增:sort赞成被废弃sort_valuessort_index

sort_values 将像这样工作:

    import pandas as pd
    df = pd.DataFrame({'a': [1, 2, 1], 'b': [1, 2, 3]})
    df['count'] = \
    df.groupby('a')['a']\
    .transform(pd.Series.value_counts)
    df.sort_values('count', inplace=True, ascending=False)
    print('df sorted: \n{}'.format(df))
Run Code Online (Sandbox Code Playgroud)
df sorted:
a  b  count
0  1  1      2
2  1  3      2
1  2  2      1
Run Code Online (Sandbox Code Playgroud)