Python分析csv文件

Rai*_*noa 0 python csv

我试图在1992年找到三个人口最多的城市部分(比亚迪)

我有一个csv文件,如下所示:http://data.kk.dk/dataset/9070067f-ab57-41cd-913e-bc37bfaf9acd/resource/9fbab4aa-1ee0-4d25-b2b4-b7b63537d2ec/download/befkbhalderkoencivst.csv >

csv文件可以解释为:

AAR:哪一年进行了观察

比亚迪:这个城市的哪个部分,由下面的字典中包含的整数描述; 1 = Indre By,2 =Østerbro,3 =Nørrebro,4 = Vesterbro/Kgs.Enghave,5 = Valby,6 =Vanløse,7 =Brønshøj-Husum,8 = Bispebjerg,9 =AmagerØst,10 = Amager Vest,99 = Udenfor inddeling

ALDER:被观察者的年龄

PERSONER:具有该行的给定特征的观察数

我有一个解决方案,但它是非常重复的,我认为它可以更聪明地完成,但我没有足够的python经验.有人能指出我正确的方向吗?

我的代码/解决方案如下所示:

df = pd.read_csv('befkbh.csv',quotechar='"',skipinitialspace=True, delimiter=',', encoding='latin1').fillna(0)
data = df.as_matrix()
Q31 = collections.defaultdict(list)
Q32 = collections.defaultdict(list)
Q33 = collections.defaultdict(list)
Q34 = collections.defaultdict(list)
Q35 = collections.defaultdict(list)
Q36 = collections.defaultdict(list)
Q37 = collections.defaultdict(list)
Q38 = collections.defaultdict(list)
Q39 = collections.defaultdict(list)
Q310 = collections.defaultdict(list)
Q399 = collections.defaultdict(list)
for row in data:
    key = row[0]
    if key == "" or key == 0: continue
    if key == 1992:
        if row[2] == 1:
            val = 0 if(row[5]) ==""  else float(row[5])
            Q31.setdefault(key,[]).append(val)
        if row[2] == 2:
            val = 0 if(row[5]) ==""  else float(row[5])
            Q32.setdefault(key,[]).append(val)
        if row[2] == 3:
            val = 0 if(row[5]) ==""  else float(row[5])
            Q33.setdefault(key,[]).append(val)
        if row[2] == 4:
            val = 0 if(row[5]) ==""  else float(row[5])
            Q34.setdefault(key,[]).append(val)
        if row[2] == 5:
            val = 0 if(row[5]) ==""  else float(row[5])
            Q35.setdefault(key,[]).append(val)
        if row[2] == 6:
            val = 0 if(row[5]) ==""  else float(row[5])
            Q36.setdefault(key,[]).append(val)
        if row[2] == 7:
            val = 0 if(row[5]) ==""  else float(row[5])
            Q37.setdefault(key,[]).append(val)
        if row[2] == 8:
            val = 0 if(row[5]) ==""  else float(row[5])
            Q38.setdefault(key,[]).append(val)
        if row[2] == 9:
            val = 0 if(row[5]) ==""  else float(row[5])
            Q39.setdefault(key,[]).append(val)
        if row[2] == 10:
            val = 0 if(row[5]) ==""  else float(row[5])
            Q310.setdefault(key,[]).append(val)
        if row[2] == 99:
            val = 0 if(row[5]) ==""  else float(row[5])
            Q399.setdefault(key,[]).append(val)

Q312 = {}
for k, v in Q31.items(): Q312[k] = sum(v)
for k, v in Q312.items(): print ("{}:{}".format(k,v))
Q322 = {}
for k, v in Q32.items(): Q322[k] = sum(v)
for k, v in Q322.items(): print ("{}:{}".format(k,v))
Q332 = {}
for k, v in Q33.items(): Q332[k] = sum(v)
for k, v in Q332.items(): print ("{}:{}".format(k,v))
Q342 = {}
for k, v in Q34.items(): Q342[k] = sum(v)
for k, v in Q342.items(): print ("{}:{}".format(k,v))
Q352 = {}
for k, v in Q35.items(): Q352[k] = sum(v)
for k, v in Q352.items(): print ("{}:{}".format(k,v))
Q362 = {}
for k, v in Q36.items(): Q362[k] = sum(v)
for k, v in Q362.items(): print ("{}:{}".format(k,v))
Q372 = {}
for k, v in Q37.items(): Q372[k] = sum(v)
for k, v in Q372.items(): print ("{}:{}".format(k,v))
Q382 = {}
for k, v in Q38.items(): Q382[k] = sum(v)
for k, v in Q382.items(): print ("{}:{}".format(k,v))
Q392 = {}
for k, v in Q39.items(): Q392[k] = sum(v)
for k, v in Q392.items(): print ("{}:{}".format(k,v))
Q3102 = {}
for k, v in Q310.items(): Q3102[k] = sum(v)
for k, v in Q3102.items(): print ("{}:{}".format(k,v))
Q3992 = {}
for k, v in Q399.items(): Q3992[k] = sum(v)
for k, v in Q3992.items(): print ("{}:{}".format(k,v))
Run Code Online (Sandbox Code Playgroud)

DSM*_*DSM 5

这实际上是一个非常好的迹象,你已经认识到必须有一个更简单的方法!每当你发现自己违反DRY原则(不要重复自己)时,你应该问你是否犯了一个失误.

虽然你可以简单地通过使用字典字典而不是所有那些命名变量来删除大量的复制,但是因为你正在使用pandas,我会利用groupbynlargest不是,这给了我:

In [47]: dg = df.groupby(["AAR", "BYDEL"], as_index=False)["PERSONER"].sum()

In [48]: dg[dg.AAR == 1992].nlargest(3, "PERSONER")
Out[48]: 
    AAR  BYDEL  PERSONER
2  1992      3     67251
1  1992      2     62221
3  1992      4     47854
Run Code Online (Sandbox Code Playgroud)

首先,我们对AAR和BYDEL列进行分组,在每个组中,我们采用PERSONER值并对它们求和.这给了我们一个开始的框架

n [51]: dg.head(15)
Out[51]: 
     AAR  BYDEL  PERSONER
0   1992      1     40595
1   1992      2     62221
2   1992      3     67251
3   1992      4     47854
4   1992      5     43688
5   1992      6     34303
6   1992      7     36746
7   1992      8     41668
8   1992      9     45305
9   1992     10     42748
10  1992     99      2187
11  1993      1     40925
12  1993      2     62583
13  1993      3     67783
14  1993      4     47589
Run Code Online (Sandbox Code Playgroud)

然后我们选择AAR == 1992的行,以及具有3个最大PERSONER值的行.

如果您打算进行这种类型的数据处理,我强烈建议您阅读一个pandas教程,否则您会发现自己正在重新发明轮子.