我有一个CSV文件,其中包含3列11行的以下内容,第一行是标题.我自己创建了一个简单的文件来学习.每个订单项都是一个水果订单.
OrderNo Fruit Origin
1 Apple NY
2 Orange FL
3 Banana CA
4 Pear NJ
5 Grapes VA
6 Grapes VA
7 Grapes MD
8 Grapes MA
9 Pineapple HI
10 Grapes GA
Run Code Online (Sandbox Code Playgroud)
我试图在Python中解析这些数据,以执行以下操作:
(1)确定每种水果产生最多订单的状态和(2)确定每种水果的任何单一状态的最高订单数量,(3)按字母顺序输出该结果,如下所示:
Apple NY 1
Banana CA 1
Grapes VA 2
Orange FL 1
Pear NJ 1
Pineapple HI 1
Run Code Online (Sandbox Code Playgroud)
用csv.reader读取csv文件后,我试图用Counter和for循环完成计数:
import csv
from collections import Counter
cnt = Counter()
f = open("/test.csv")
reader = csv.reader(f, delimiter=",")
header = next(f)
for row in reader:
cnt[row[2]] += 1
Run Code Online (Sandbox Code Playgroud)
但有更好的方法吗?
我实际上使用的是pandas,它是list/dictionary/spreadsheet/database的组合.它专门用于以这种方式操作数据.
import pandas as pd
from collections import defaultdict
path_to_file = "/test.csv"
df = pd.read_csv(path_to_file)
groups = df.groupby(['Fruit', 'Origin'])
max_for_fruit = defaultdict(int) #first pass through the groups, store the maximum for each fruit to handle ties
for g in groups:
fruit, count = g[0][0], len(g[1])
max_for_fruit[ fruit ] = max( max_for_fruit[fruit], count )
for g in groups:
fruit, state, count = g[0][0], g[0][1], len(g[1])
if count == max_for_fruit[ fruit ]:
print( "{} {} {}".format(fruit, state, count ) )
Run Code Online (Sandbox Code Playgroud)
这是输出.
Apple NY 1
Banana CA 1
Grapes VA 2
Orange FL 1
Pear NJ 1
Pineapple HI 1
Run Code Online (Sandbox Code Playgroud)
http://pandas.pydata.org/pandas-docs/stable/groupby.html
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.io.parsers.read_csv.html
http://pandas.pydata.org/pandas-docs/stable/tutorials.html
| 归档时间: |
|
| 查看次数: |
2006 次 |
| 最近记录: |