使用具有不同格式的csv中的数据更新csv

Oli*_*eph 5 python csv python-2.7 pandas

我正在尝试使用其他来源提供的一些学生数据来更新csv文件,但是他们已经将他们的csv数据格式化为与我们的略有不同.

它需要根据三个标准匹配学生的名字,他们的班级,最后是该地点的前几个字母,所以对于B班的前几个学生Dumpt来说实际​​上是Dumpton Park.

找到匹配项时

  • 如果CSV 2中学生的记分卡为0或空白,则不应更新CSV 1中的分数列
  • 如果学生在CSV 2中的编号为0或为空,则不应更新CSV 1中的"否"列
  • 否则,它应该将CSV 2中的数字导入CSV1

以下是一些示例数据:

CSV 1

Class,Local,Name,DPE,JJK,Score,No
Class A,York,Tom,x,x,32,
Class A,York,Jim,x,x,10,
Class A,York,Sam,x,x,32,
Class B,Dumpton Park,Sarah,x,x,,
Class B,Dumpton Park,Bob,x,x,,
Class B,Dumpton Park,Bill,x,x,,
Class A,Dover,Andy,x,x,,
Class A,Dover,Hannah,x,x,,
Class B,London,Jemma,x,x,,
Class B,London,James,x,x,,
Run Code Online (Sandbox Code Playgroud)

CSV 2

"Class","Location","Student","Scorecard","Number"
"Class A","York","Jim","0","742"
"Class A","York","Sam","0","931"
"Class A","York","Tom","0","653"
"Class B","Dumpt","Bob","23.1","299"
"Class B","Dumpt","Bill","23.4","198"
"Class B","Dumpt","Sarah","23.5","12"
"Class A","Dover","Andy","23","983"
"Class A","Dover","Hannah","1","293"
"Class B","Lond","Jemma","32.2","0"
"Class B","Lond","James","32.0","0"
Run Code Online (Sandbox Code Playgroud)

CSV 1 UPDATED(这是所需的输出)

Class,Local,Name,DPE,JJK,Score,No
Class A,York,Tom,x,x,32,653
Class A,York,Jim,x,x,10,742
Class A,York,Sam,x,x,32,653
Class B,Dumpton Park,Sarah,x,x,23.5,12
Class B,Dumpton Park,Bob,x,x,23.1,299
Class B,Dumpton Park,Bill,x,x,23.4,198
Class A,Dover,Andy,x,x,23,983
Class A,Dover,Hannah,x,x,1,293
Class B,London,Jemma,x,x,32.2,
Class B,London,James,x,x,32.0,
Run Code Online (Sandbox Code Playgroud)

我真的很感激这个问题的任何帮助.谢谢奥利弗

Chr*_* H. 9

这里有两个解决方案:一个熊猫解决方案和一个普通的python解决方案.首先是一个熊猫解决方案,不出所料看起来像其他熊猫解决方案一样......

首先加载数据

import pandas
import numpy as np

cdf1 = pandas.read_csv('csv1',dtype=object)  #dtype = object allows us to preserve the numeric formats
cdf2 = pandas.read_csv('csv2',dtype=object)

col_order = cdf1.columns  #pandas will shuffle the column order at some point---this allows us to reset ot original column order
Run Code Online (Sandbox Code Playgroud)

此时数据框将如下所示

In [6]: cdf1
Out[6]: 
     Class         Local    Name DPE JJK Score   No
0  Class A          York     Tom   x   x    32  NaN
1  Class A          York     Jim   x   x    10  NaN
2  Class A          York     Sam   x   x    32  NaN
3  Class B  Dumpton Park   Sarah   x   x   NaN  NaN
4  Class B  Dumpton Park     Bob   x   x   NaN  NaN
5  Class B  Dumpton Park    Bill   x   x   NaN  NaN
6  Class A         Dover    Andy   x   x   NaN  NaN
7  Class A         Dover  Hannah   x   x   NaN  NaN
8  Class B        London   Jemma   x   x   NaN  NaN
9  Class B        London   James   x   x   NaN  NaN

In [7]: cdf2
Out[7]: 
     Class Location Student Scorecard Number
0  Class A     York     Jim         0    742
1  Class A     York     Sam         0    931
2  Class A     York     Tom         0    653
3  Class B    Dumpt     Bob      23.1    299
4  Class B    Dumpt    Bill      23.4    198
5  Class B    Dumpt   Sarah      23.5     12
6  Class A    Dover    Andy        23    983
7  Class A    Dover  Hannah         1    293
8  Class B     Lond   Jemma      32.2      0
9  Class B     Lond   James      32.0      0
Run Code Online (Sandbox Code Playgroud)

接下来将数据帧处理为匹配格式.

dcol = cdf2.Location 
cdf2['Location'] = dcol.apply(lambda x: x[0:4])  #Replacement in cdf2 since we don't need original data

dcol = cdf1.Local
cdf1['Location'] = dcol.apply(lambda x: x[0:4])  #Here we add a column leaving 'Local' because we'll need it for the final output

cdf2 = cdf2.rename(columns={'Student': 'Name', 'Scorecard': 'Score', 'Number': 'No'})
cdf2 = cdf2.replace('0', np.nan)  #Replacing '0' by np.nan means zeros don't overwrite

cdf1 = cdf1.set_index(['Class', 'Location', 'Name'])
cdf2 = cdf2.set_index(['Class', 'Location', 'Name'])
Run Code Online (Sandbox Code Playgroud)

现在cdf1和cdf2看起来像

In [16]: cdf1
Out[16]: 
                                Local DPE JJK Score   No
Class   Location Name                                   
Class A York     Tom             York   x   x    32  NaN
                 Jim             York   x   x    10  NaN
                 Sam             York   x   x    32  NaN
Class B Dump     Sarah   Dumpton Park   x   x   NaN  NaN
                 Bob     Dumpton Park   x   x   NaN  NaN
                 Bill    Dumpton Park   x   x   NaN  NaN
Class A Dove     Andy           Dover   x   x   NaN  NaN
                 Hannah         Dover   x   x   NaN  NaN
Class B Lond     Jemma         London   x   x   NaN  NaN
                 James         London   x   x   NaN  NaN

In [17]: cdf2
Out[17]: 
                        Score   No
Class   Location Name             
Class A York     Jim      NaN  742
                 Sam      NaN  931
                 Tom      NaN  653
Class B Dump     Bob     23.1  299
                 Bill    23.4  198
                 Sarah   23.5   12
Class A Dove     Andy      23  983
                 Hannah     1  293
Class B Lond     Jemma   32.2  NaN
                 James   32.0  NaN
Run Code Online (Sandbox Code Playgroud)

使用cdf2中的数据更新cdf1中的数据

cdf1.update(cdf2, overwrite=False)
Run Code Online (Sandbox Code Playgroud)

结果是

In [19]: cdf1
Out[19]: 
                                Local DPE JJK Score   No
Class   Location Name                                   
Class A York     Tom             York   x   x    32  653
                 Jim             York   x   x    10  742
                 Sam             York   x   x    32  931
Class B Dump     Sarah   Dumpton Park   x   x  23.5   12
                 Bob     Dumpton Park   x   x  23.1  299
                 Bill    Dumpton Park   x   x  23.4  198
Class A Dove     Andy           Dover   x   x    23  983
                 Hannah         Dover   x   x     1  293
Class B Lond     Jemma         London   x   x  32.2  NaN
                 James         London   x   x  32.0  NaN
Run Code Online (Sandbox Code Playgroud)

最后将cdf1返回到它的原始格式并将其写入csv文件.

cdf1 = cdf1.reset_index()  #These two steps allow us to remove the 'Location' column
del cdf1['Location']    
cdf1 = cdf1[col_order]     #This will switch Local and Name back to their original order

cdf1.to_csv('temp.csv',index = False)
Run Code Online (Sandbox Code Playgroud)

两个注意事项:首先,考虑到使用cdf1.Local.value_counts()或len(cdf1.Local.value_counts())等是多么容易.我强烈建议添加一些检查求和以确保从Location转移到位置的前几个字母,你不是偶然消除了一个位置.其次,我真诚地希望你所需输出的第4行有一个拼写错误.

在一个普通的python解决方案上.在下面,根据需要调整文件名.

#Open all of the necessary files
csv1 = open('csv1','r')
csv2 = open('csv2','r')
csvout = open('csv_out','w')

#Read past both headers and write the header to the outfile
wstr = csv1.readline()
csvout.write(wstr)
csv2.readline()

#Read csv1 into a dictionary with keys of Class,Name,and first four digits of Local and keep a list of keys for line ordering
line_keys = []
line_dict = {}
for line in csv1:
    s = line.split(',')
    this_key = (s[0],s[1][0:4],s[2])
    line_dict[this_key] = s
    line_keys.append(this_key)

#Go through csv2 updating the data in csv1 as necessary
for line in csv2:
    s = line.replace('\"','').split(',')
    this_key = (s[0],s[1][0:4],s[2])
    if this_key in line_dict:   #Lowers the crash rate...
        #Check if need to replace Score...
        if len(s[3]) > 0 and float(s[3]) != 0:
            line_dict[this_key][5] = s[3]
        #Check if need to repace No...
        if len(s[4]) > 0 and float(s[4]) != 0:
            line_dict[this_key][6] = s[4]
    else:
        print "Line not in csv1: %s"%line

#Write the updated line_dict to csvout
for key in line_keys:
    wstr = ','.join(line_dict[key])
    csvout.write(wstr)
csvout.write('\n')

#Close all of the open filehandles
csv1.close()
csv2.close()
csvout.close()
Run Code Online (Sandbox Code Playgroud)


And*_*den 5

您可以使用fuzzywuzzy进行城镇名称的匹配,并将其作为列附加到df2:

df1 = pd.read_csv(csv1)
df2 = pd.read_csv(csv2)

towns = df1.Local.unique()  # assuming this is complete list of towns

from fuzzywuzzy.fuzz import partial_ratio

In [11]: df2['Local'] =  df2.Location.apply(lambda short_location: max(towns, key=lambda t: partial_ratio(short_location, t)))

In [12]: df2
Out[12]: 
     Class Location Student  Scorecard  Number         Local
0  Class A     York     Jim        0.0     742          York
1  Class A     York     Sam        0.0     931          York
2  Class A     York     Tom        0.0     653          York
3  Class B    Dumpt     Bob       23.1     299  Dumpton Park
4  Class B    Dumpt    Bill       23.4     198  Dumpton Park
5  Class B    Dumpt   Sarah       23.5      12  Dumpton Park
6  Class A    Dover    Andy       23.0     983         Dover
7  Class A    Dover  Hannah        1.0     293         Dover
8  Class B     Lond   Jemma       32.2       0        London
9  Class B     Lond   James       32.0       0        London
Run Code Online (Sandbox Code Playgroud)

使名称保持一致(目前学生和姓名被错误命名):

In [13]: df2.rename_axis({'Student': 'Name'}, axis=1, inplace=True)
Run Code Online (Sandbox Code Playgroud)

现在您可以合并(在重叠列上):

In [14]: res = df1.merge(df2, how='outer')

In [15]: res
Out[15]: 
     Class         Local    Name DPE JJK  Score  No Location  Scorecard  Number
0  Class A          York     Tom   x   x     32 NaN     York        0.0     653
1  Class A          York     Jim   x   x     10 NaN     York        0.0     742
2  Class A          York     Sam   x   x     32 NaN     York        0.0     931
3  Class B  Dumpton Park   Sarah   x   x    NaN NaN    Dumpt       23.5      12
4  Class B  Dumpton Park     Bob   x   x    NaN NaN    Dumpt       23.1     299
5  Class B  Dumpton Park    Bill   x   x    NaN NaN    Dumpt       23.4     198
6  Class A         Dover    Andy   x   x    NaN NaN    Dover       23.0     983
7  Class A         Dover  Hannah   x   x    NaN NaN    Dover        1.0     293
8  Class B        London   Jemma   x   x    NaN NaN     Lond       32.2       0
9  Class B        London   James   x   x    NaN NaN     Lond       32.0       0
Run Code Online (Sandbox Code Playgroud)

有一点要清理的是得分,我想我会把两者中的最大值:

In [16]: res['Score'] = res.loc[:, ['Score', 'Scorecard']].max(1)

In [17]: del res['Scorecard'] 
         del res['No']
         del res['Location']
Run Code Online (Sandbox Code Playgroud)

然后你留下你想要的列:

In [18]: res
Out[18]: 
     Class         Local    Name DPE JJK  Score  Number
0  Class A          York     Tom   x   x   32.0     653
1  Class A          York     Jim   x   x   10.0     742
2  Class A          York     Sam   x   x   32.0     931
3  Class B  Dumpton Park   Sarah   x   x   23.5      12
4  Class B  Dumpton Park     Bob   x   x   23.1     299
5  Class B  Dumpton Park    Bill   x   x   23.4     198
6  Class A         Dover    Andy   x   x   23.0     983
7  Class A         Dover  Hannah   x   x    1.0     293
8  Class B        London   Jemma   x   x   32.2       0
9  Class B        London   James   x   x   32.0       0

In [18]: res.to_csv('foo.csv')
Run Code Online (Sandbox Code Playgroud)

注意:要强制dtype为object(并且具有混合dtypes,int和浮点数,而不是所有浮点数),您可以使用apply.如果您正在进行任何分析,我会建议不要这样做!

res['Score'] = res['Score'].apply(lambda x: int(x) if int(x) == x else x, convert_dtype=False)
Run Code Online (Sandbox Code Playgroud)


Eth*_*man 5

希望这段代码更具可读性.;)Python的新Enum类型的后端在这里.

from enum import Enum       # see PyPI for the backport (enum34)

class Field(Enum):

    course = 0
    location = 1
    student = 2
    dpe = 3
    jjk = 4
    score = -2
    number = -1

    def __index__(self):
        return self._value_

def Float(text):
    if not text:
        return 0.0
    return float(text)

def load_our_data(filename):
    "return a dict using the first three fields as the key"
    data = dict()
    with open(filename) as input:
        next(input)  # throw away header
        for line in input:
            fields = line.strip('\n').split(',')
            fields[Field.score] = Float(fields[Field.score])
            fields[Field.number] = Float(fields[Field.number])
            key = (
                fields[Field.course].lower(),
                fields[Field.location][:4].lower(),
                fields[Field.student].lower(),
                )
            data[key] = fields
    return data

def load_their_data(filename):
    "return a dict using the first three fields as the key"
    data = dict()
    with open(filename) as input:
        next(input)  # throw away header
        for line in input:
            fields = line.strip('\n').split(',')
            fields = [f.strip('"') for f in fields]
            fields[Field.score] = Float(fields[Field.score])
            fields[Field.number] = Float(fields[Field.number])
            key = (
                fields[Field.course].lower(),
                fields[Field.location][:4].lower(),
                fields[Field.student].lower(),
                )
            data[key] = fields
    return data

def merge_data(ours, theirs):
    "their data is only used if not blank and non-zero"
    for key, our_data in ours.items():
        their_data = theirs[key]
        if their_data[Field.score]:
            our_data[Field.score] = their_data[Field.score]
        if their_data[Field.number]:
            our_data[Field.number] = their_data[Field.number]

def write_our_data(data, filename):
    with open(filename, 'w') as output:
        for record in sorted(data.values()):
            line = ','.join([str(f) for f in record])
            output.write(line + '\n')

if __name__ == '__main__':
    ours = load_our_data('one.csv')
    theirs = load_their_data('two.csv')
    merge_data(ours, theirs)
    write_our_data(ours, 'three.csv')
Run Code Online (Sandbox Code Playgroud)


met*_*oia 4

Python 字典是解决这个问题的方法:

studentDict = {}

with open(<csv1>, 'r') as f:
  for line in f:
    LL = line.rstrip('\n').replace('"','').split(',')
    studentDict[LL[0], LL[1], LL[2]] = LL[3:]

with open(<csv2>, 'r') as f:
  for line in f:
    LL = line.rstrip('\n').replace('"','').split(',')
    if LL[-2] not in ('0', ''): studentDict[LL[0], LL[1], LL[2]][-2] = LL[-2]
    if LL[-1] not in ('0', ''): studentDict[LL[0], LL[1], LL[2]][-1] = LL[-1]

with open(<outFile>, 'w') as f:
  for k in studentDict.keys():
    v = studentDict[k[0], k[1], k[2]]
    f.write(k[0] + ',' + k[1] + ',' + k[2] + ',' + v[0] + ',' + v[1] + ',' + v[2] + ',' + v[3] + '\n')
Run Code Online (Sandbox Code Playgroud)