jen*_*ryb 5 python csv sorting frequency pandas
我有一个大的csv文件,它是调用者数据的日志.
我文件的简短片段:
CompanyName High Priority QualityIssue
Customer1 Yes User
Customer1 Yes User
Customer2 No User
Customer3 No Equipment
Customer1 No Neither
Customer3 No User
Customer3 Yes User
Customer3 Yes Equipment
Customer4 No User
Run Code Online (Sandbox Code Playgroud)
我想按照客户出现的频率对整个列表进行排序,这样就像:
CompanyName High Priority QualityIssue
Customer3 No Equipment
Customer3 No User
Customer3 Yes User
Customer3 Yes Equipment
Customer1 Yes User
Customer1 Yes User
Customer1 No Neither
Customer2 No User
Customer4 No User
Run Code Online (Sandbox Code Playgroud)
我尝试过groupby,但是只打印出公司名称和频率而不是其他列,我也试过了
df['Totals']= [sum(df['CompanyName'] == df['CompanyName'][i]) for i in xrange(len(df))]
Run Code Online (Sandbox Code Playgroud)
和
df = [sum(df['CompanyName'] == df['CompanyName'][i]) for i in xrange(len(df))]
Run Code Online (Sandbox Code Playgroud)
但这些给了我错误:ValueError:错误的项目数量传递1,索引意味着24
我看过这样的事情:
for key, value in sorted(mydict.iteritems(), key=lambda (k,v): (v,k)):
print "%s: %s" % (key, value)
Run Code Online (Sandbox Code Playgroud)
但这只打印出两列,我想整理我的整个csv.我的输出应该是我的整个csv按第一列排序.
我在这里先向您的帮助表示感谢!
这似乎做你想要的,基本上通过执行a groupby和transformwith 添加一个count列value_counts然后你可以对该列进行排序:
In [22]:
df['count'] = df.groupby('CompanyName')['CompanyName'].transform(pd.Series.value_counts)
df.sort('count', ascending=False)
Out[22]:
CompanyName HighPriority QualityIssue count
5 Customer3 No User 4
3 Customer3 No Equipment 4
7 Customer3 Yes Equipment 4
6 Customer3 Yes User 4
0 Customer1 Yes User 3
4 Customer1 No Neither 3
1 Customer1 Yes User 3
8 Customer4 No User 1
2 Customer2 No User 1
Run Code Online (Sandbox Code Playgroud)
您可以使用df.drop以下方法删除无关列:
In [24]:
df.drop('count', axis=1)
Out[24]:
CompanyName HighPriority QualityIssue
5 Customer3 No User
3 Customer3 No Equipment
7 Customer3 Yes Equipment
6 Customer3 Yes User
0 Customer1 Yes User
4 Customer1 No Neither
1 Customer1 Yes User
8 Customer4 No User
2 Customer2 No User
Run Code Online (Sandbox Code Playgroud)
小智 6
该函数pd.Series.value_counts返回一个包含唯一值计数的系列。但是我们应用该pd.Series.value_counts函数的 Series本身只包含一个唯一值,因为我们之前应用groupby到了 DataFrame 并将CompanyName Series拆分为一组唯一值。因此,我们应用该函数后的最终输出将如下所示。
Customer3 4
dtype: int64
Run Code Online (Sandbox Code Playgroud)
这是无稽之谈,我们不能将系列中的值转换为整个系列。不知何故,我们只需要整数4而不是整个系列。
但是,我们可以groupby通过计算每个组中的值的数量,将整个组转换为该组中的值的数量,并将它们放在一起形成最终的频率序列,从而更早地利用该函数。
我们可以更换pd.Series.value_counts用pd.Series.count或者只是简单地使用功能名称count
import pandas as pd
df = pd.DataFrame({'CompanyName': {0: 'Customer1', 1: 'Customer1', 2: 'Customer2', 3: 'Customer3', 4: 'Customer1', 5: 'Customer3', 6: 'Customer3', 7: 'Customer3', 8: 'Customer4'}, 'HighPriority': {0: 'Yes', 1: 'Yes', 2: 'No', 3: 'No', 4: 'No', 5: 'No', 6: 'Yes', 7: 'Yes', 8: 'No'}, 'QualityIssue': {0: 'User', 1: 'User', 2: 'User', 3: 'Equipment', 4: 'Neither', 5: 'User', 6: 'User', 7: 'Equipment', 8: 'User'}})
df['Frequency'] = df.groupby('CompanyName')['CompanyName'].transform('count')
df.sort_values('Frequency', inplace=True, ascending=False)
Run Code Online (Sandbox Code Playgroud)
>>> df
CompanyName HighPriority QualityIssue Frequency
3 Customer3 No Equipment 4
5 Customer3 No User 4
6 Customer3 Yes User 4
7 Customer3 Yes Equipment 4
0 Customer1 Yes User 3
1 Customer1 Yes User 3
4 Customer1 No Neither 3
2 Customer2 No User 1
8 Customer4 No User 1
Run Code Online (Sandbox Code Playgroud)
在得票最多的答案需要一个不起眼的新增:sort赞成被废弃sort_values和sort_index。
sort_values 将像这样工作:
import pandas as pd
df = pd.DataFrame({'a': [1, 2, 1], 'b': [1, 2, 3]})
df['count'] = \
df.groupby('a')['a']\
.transform(pd.Series.value_counts)
df.sort_values('count', inplace=True, ascending=False)
print('df sorted: \n{}'.format(df))
Run Code Online (Sandbox Code Playgroud)
Run Code Online (Sandbox Code Playgroud)df sorted: a b count 0 1 1 2 2 1 3 2 1 2 2 1
| 归档时间: |
|
| 查看次数: |
3426 次 |
| 最近记录: |