Python:如何最好地解析csv并仅计算子集的值

Question

Python:如何最好地解析csv并仅计算子集的值

我有一个CSV文件,其中包含3列11行的以下内容,第一行是标题.我自己创建了一个简单的文件来学习.每个订单项都是一个水果订单.

OrderNo      Fruit     Origin
1           Apple        NY
2           Orange       FL      
3           Banana       CA
4           Pear         NJ
5           Grapes       VA
6           Grapes       VA
7           Grapes       MD
8           Grapes       MA
9           Pineapple    HI
10          Grapes       GA

Run Code Online (Sandbox Code Playgroud)

我试图在Python中解析这些数据,以执行以下操作:

(1)确定每种水果产生最多订单的状态和(2)确定每种水果的任何单一状态的最高订单数量,(3)按字母顺序输出该结果,如下所示:

Apple NY 1
Banana CA 1
Grapes VA 2
Orange FL 1
Pear NJ 1
Pineapple HI 1

Run Code Online (Sandbox Code Playgroud)

用csv.reader读取csv文件后,我试图用Counter和for循环完成计数:

import csv
from collections import Counter 

cnt = Counter()
f = open("/test.csv")
reader = csv.reader(f, delimiter=",")
header = next(f) 

for row in reader:   
    cnt[row[2]] += 1

Run Code Online (Sandbox Code Playgroud)

但有更好的方法吗？

Answer 1

Tom*_*m M 5

我实际上使用的是pandas,它是list/dictionary/spreadsheet/database的组合.它专门用于以这种方式操作数据.

import pandas as pd
from collections import defaultdict

path_to_file = "/test.csv"
df = pd.read_csv(path_to_file)

groups = df.groupby(['Fruit', 'Origin'])
max_for_fruit = defaultdict(int) #first pass through the groups, store the maximum for each fruit to handle ties

for g in groups:
    fruit, count = g[0][0], len(g[1])
    max_for_fruit[ fruit ] = max( max_for_fruit[fruit], count )

for g in groups:
    fruit, state, count = g[0][0], g[0][1], len(g[1])
    if count == max_for_fruit[ fruit ]:
        print( "{} {} {}".format(fruit, state, count ) )

Run Code Online (Sandbox Code Playgroud)

这是输出.

Apple NY 1
Banana CA 1
Grapes VA 2
Orange FL 1
Pear NJ 1
Pineapple HI 1

Run Code Online (Sandbox Code Playgroud)

http://pandas.pydata.org/pandas-docs/stable/groupby.html

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.io.parsers.read_csv.html

http://pandas.pydata.org/pandas-docs/stable/tutorials.html

归档时间：	11 年，7 月前
查看次数：	2006 次
最近记录：	11 年，7 月前