使用具有不同格式的csv中的数据更新csv

Oli*_*eph 5 python csv python-2.7 pandas

我正在尝试使用其他来源提供的一些学生数据来更新csv文件,但是他们已经将他们的csv数据格式化为与我们的略有不同.

它需要根据三个标准匹配学生的名字,他们的班级,最后是该地点的前几个字母,所以对于B班的前几个学生Dumpt来说实际上是Dumpton Park.

找到匹配项时

如果CSV 2中学生的记分卡为0或空白,则不应更新CSV 1中的分数列
如果学生在CSV 2中的编号为0或为空,则不应更新CSV 1中的"否"列
否则,它应该将CSV 2中的数字导入CSV1

以下是一些示例数据:

CSV 1

Class,Local,Name,DPE,JJK,Score,No
Class A,York,Tom,x,x,32,
Class A,York,Jim,x,x,10,
Class A,York,Sam,x,x,32,
Class B,Dumpton Park,Sarah,x,x,,
Class B,Dumpton Park,Bob,x,x,,
Class B,Dumpton Park,Bill,x,x,,
Class A,Dover,Andy,x,x,,
Class A,Dover,Hannah,x,x,,
Class B,London,Jemma,x,x,,
Class B,London,James,x,x,,

Run Code Online (Sandbox Code Playgroud)

CSV 2

"Class","Location","Student","Scorecard","Number"
"Class A","York","Jim","0","742"
"Class A","York","Sam","0","931"
"Class A","York","Tom","0","653"
"Class B","Dumpt","Bob","23.1","299"
"Class B","Dumpt","Bill","23.4","198"
"Class B","Dumpt","Sarah","23.5","12"
"Class A","Dover","Andy","23","983"
"Class A","Dover","Hannah","1","293"
"Class B","Lond","Jemma","32.2","0"
"Class B","Lond","James","32.0","0"

Run Code Online (Sandbox Code Playgroud)

CSV 1 UPDATED(这是所需的输出)

Class,Local,Name,DPE,JJK,Score,No
Class A,York,Tom,x,x,32,653
Class A,York,Jim,x,x,10,742
Class A,York,Sam,x,x,32,653
Class B,Dumpton Park,Sarah,x,x,23.5,12
Class B,Dumpton Park,Bob,x,x,23.1,299
Class B,Dumpton Park,Bill,x,x,23.4,198
Class A,Dover,Andy,x,x,23,983
Class A,Dover,Hannah,x,x,1,293
Class B,London,Jemma,x,x,32.2,
Class B,London,James,x,x,32.0,

Run Code Online (Sandbox Code Playgroud)

我真的很感激这个问题的任何帮助.谢谢奥利弗

这里有两个解决方案:一个熊猫解决方案和一个普通的python解决方案.首先是一个熊猫解决方案,不出所料看起来像其他熊猫解决方案一样......

首先加载数据

import pandas
import numpy as np

cdf1 = pandas.read_csv('csv1',dtype=object)  #dtype = object allows us to preserve the numeric formats
cdf2 = pandas.read_csv('csv2',dtype=object)

col_order = cdf1.columns  #pandas will shuffle the column order at some point---this allows us to reset ot original column order

Run Code Online (Sandbox Code Playgroud)

此时数据框将如下所示

In [6]: cdf1
Out[6]: 
     Class         Local    Name DPE JJK Score   No
0  Class A          York     Tom   x   x    32  NaN
1  Class A          York     Jim   x   x    10  NaN
2  Class A          York     Sam   x   x    32  NaN
3  Class B  Dumpton Park   Sarah   x   x   NaN  NaN
4  Class B  Dumpton Park     Bob   x   x   NaN  NaN
5  Class B  Dumpton Park    Bill   x   x   NaN  NaN
6  Class A         Dover    Andy   x   x   NaN  NaN
7  Class A         Dover  Hannah   x   x   NaN  NaN
8  Class B        London   Jemma   x   x   NaN  NaN
9  Class B        London   James   x   x   NaN  NaN

In [7]: cdf2
Out[7]: 
     Class Location Student Scorecard Number
0  Class A     York     Jim         0    742
1  Class A     York     Sam         0    931
2  Class A     York     Tom         0    653
3  Class B    Dumpt     Bob      23.1    299
4  Class B    Dumpt    Bill      23.4    198
5  Class B    Dumpt   Sarah      23.5     12
6  Class A    Dover    Andy        23    983
7  Class A    Dover  Hannah         1    293
8  Class B     Lond   Jemma      32.2      0
9  Class B     Lond   James      32.0      0

Run Code Online (Sandbox Code Playgroud)

接下来将数据帧处理为匹配格式.

dcol = cdf2.Location 
cdf2['Location'] = dcol.apply(lambda x: x[0:4])  #Replacement in cdf2 since we don't need original data

dcol = cdf1.Local
cdf1['Location'] = dcol.apply(lambda x: x[0:4])  #Here we add a column leaving 'Local' because we'll need it for the final output

cdf2 = cdf2.rename(columns={'Student': 'Name', 'Scorecard': 'Score', 'Number': 'No'})
cdf2 = cdf2.replace('0', np.nan)  #Replacing '0' by np.nan means zeros don't overwrite

cdf1 = cdf1.set_index(['Class', 'Location', 'Name'])
cdf2 = cdf2.set_index(['Class', 'Location', 'Name'])

Run Code Online (Sandbox Code Playgroud)

现在cdf1和cdf2看起来像

In [16]: cdf1
Out[16]: 
                                Local DPE JJK Score   No
Class   Location Name                                   
Class A York     Tom             York   x   x    32  NaN
                 Jim             York   x   x    10  NaN
                 Sam             York   x   x    32  NaN
Class B Dump     Sarah   Dumpton Park   x   x   NaN  NaN
                 Bob     Dumpton Park   x   x   NaN  NaN
                 Bill    Dumpton Park   x   x   NaN  NaN
Class A Dove     Andy           Dover   x   x   NaN  NaN
                 Hannah         Dover   x   x   NaN  NaN
Class B Lond     Jemma         London   x   x   NaN  NaN
                 James         London   x   x   NaN  NaN

In [17]: cdf2
Out[17]: 
                        Score   No
Class   Location Name             
Class A York     Jim      NaN  742
                 Sam      NaN  931
                 Tom      NaN  653
Class B Dump     Bob     23.1  299
                 Bill    23.4  198
                 Sarah   23.5   12
Class A Dove     Andy      23  983
                 Hannah     1  293
Class B Lond     Jemma   32.2  NaN
                 James   32.0  NaN

Run Code Online (Sandbox Code Playgroud)

使用cdf2中的数据更新cdf1中的数据

cdf1.update(cdf2, overwrite=False)

Run Code Online (Sandbox Code Playgroud)

结果是

In [19]: cdf1
Out[19]: 
                                Local DPE JJK Score   No
Class   Location Name                                   
Class A York     Tom             York   x   x    32  653
                 Jim             York   x   x    10  742
                 Sam             York   x   x    32  931
Class B Dump     Sarah   Dumpton Park   x   x  23.5   12
                 Bob     Dumpton Park   x   x  23.1  299
                 Bill    Dumpton Park   x   x  23.4  198
Class A Dove     Andy           Dover   x   x    23  983
                 Hannah         Dover   x   x     1  293
Class B Lond     Jemma         London   x   x  32.2  NaN
                 James         London   x   x  32.0  NaN

Run Code Online (Sandbox Code Playgroud)

最后将cdf1返回到它的原始格式并将其写入csv文件.

cdf1 = cdf1.reset_index()  #These two steps allow us to remove the 'Location' column
del cdf1['Location']    
cdf1 = cdf1[col_order]     #This will switch Local and Name back to their original order

cdf1.to_csv('temp.csv',index = False)

Run Code Online (Sandbox Code Playgroud)

两个注意事项:首先,考虑到使用cdf1.Local.value_counts()或len(cdf1.Local.value_counts())等是多么容易.我强烈建议添加一些检查求和以确保从Location转移到位置的前几个字母,你不是偶然消除了一个位置.其次,我真诚地希望你所需输出的第4行有一个拼写错误.

在一个普通的python解决方案上.在下面,根据需要调整文件名.

#Open all of the necessary files
csv1 = open('csv1','r')
csv2 = open('csv2','r')
csvout = open('csv_out','w')

#Read past both headers and write the header to the outfile
wstr = csv1.readline()
csvout.write(wstr)
csv2.readline()

#Read csv1 into a dictionary with keys of Class,Name,and first four digits of Local and keep a list of keys for line ordering
line_keys = []
line_dict = {}
for line in csv1:
    s = line.split(',')
    this_key = (s[0],s[1][0:4],s[2])
    line_dict[this_key] = s
    line_keys.append(this_key)

#Go through csv2 updating the data in csv1 as necessary
for line in csv2:
    s = line.replace('\"','').split(',')
    this_key = (s[0],s[1][0:4],s[2])
    if this_key in line_dict:   #Lowers the crash rate...
        #Check if need to replace Score...
        if len(s[3]) > 0 and float(s[3]) != 0:
            line_dict[this_key][5] = s[3]
        #Check if need to repace No...
        if len(s[4]) > 0 and float(s[4]) != 0:
            line_dict[this_key][6] = s[4]
    else:
        print "Line not in csv1: %s"%line

#Write the updated line_dict to csvout
for key in line_keys:
    wstr = ','.join(line_dict[key])
    csvout.write(wstr)
csvout.write('\n')

#Close all of the open filehandles
csv1.close()
csv2.close()
csvout.close()

Run Code Online (Sandbox Code Playgroud)

您可以使用fuzzywuzzy进行城镇名称的匹配,并将其作为列附加到df2:

df1 = pd.read_csv(csv1)
df2 = pd.read_csv(csv2)

towns = df1.Local.unique()  # assuming this is complete list of towns

from fuzzywuzzy.fuzz import partial_ratio

In [11]: df2['Local'] =  df2.Location.apply(lambda short_location: max(towns, key=lambda t: partial_ratio(short_location, t)))

In [12]: df2
Out[12]: 
     Class Location Student  Scorecard  Number         Local
0  Class A     York     Jim        0.0     742          York
1  Class A     York     Sam        0.0     931          York
2  Class A     York     Tom        0.0     653          York
3  Class B    Dumpt     Bob       23.1     299  Dumpton Park
4  Class B    Dumpt    Bill       23.4     198  Dumpton Park
5  Class B    Dumpt   Sarah       23.5      12  Dumpton Park
6  Class A    Dover    Andy       23.0     983         Dover
7  Class A    Dover  Hannah        1.0     293         Dover
8  Class B     Lond   Jemma       32.2       0        London
9  Class B     Lond   James       32.0       0        London

Run Code Online (Sandbox Code Playgroud)

使名称保持一致(目前学生和姓名被错误命名):

In [13]: df2.rename_axis({'Student': 'Name'}, axis=1, inplace=True)

Run Code Online (Sandbox Code Playgroud)

现在您可以合并(在重叠列上):

In [14]: res = df1.merge(df2, how='outer')

In [15]: res
Out[15]: 
     Class         Local    Name DPE JJK  Score  No Location  Scorecard  Number
0  Class A          York     Tom   x   x     32 NaN     York        0.0     653
1  Class A          York     Jim   x   x     10 NaN     York        0.0     742
2  Class A          York     Sam   x   x     32 NaN     York        0.0     931
3  Class B  Dumpton Park   Sarah   x   x    NaN NaN    Dumpt       23.5      12
4  Class B  Dumpton Park     Bob   x   x    NaN NaN    Dumpt       23.1     299
5  Class B  Dumpton Park    Bill   x   x    NaN NaN    Dumpt       23.4     198
6  Class A         Dover    Andy   x   x    NaN NaN    Dover       23.0     983
7  Class A         Dover  Hannah   x   x    NaN NaN    Dover        1.0     293
8  Class B        London   Jemma   x   x    NaN NaN     Lond       32.2       0
9  Class B        London   James   x   x    NaN NaN     Lond       32.0       0

Run Code Online (Sandbox Code Playgroud)

有一点要清理的是得分,我想我会把两者中的最大值:

In [16]: res['Score'] = res.loc[:, ['Score', 'Scorecard']].max(1)

In [17]: del res['Scorecard'] 
         del res['No']
         del res['Location']

Run Code Online (Sandbox Code Playgroud)

然后你留下你想要的列:

In [18]: res
Out[18]: 
     Class         Local    Name DPE JJK  Score  Number
0  Class A          York     Tom   x   x   32.0     653
1  Class A          York     Jim   x   x   10.0     742
2  Class A          York     Sam   x   x   32.0     931
3  Class B  Dumpton Park   Sarah   x   x   23.5      12
4  Class B  Dumpton Park     Bob   x   x   23.1     299
5  Class B  Dumpton Park    Bill   x   x   23.4     198
6  Class A         Dover    Andy   x   x   23.0     983
7  Class A         Dover  Hannah   x   x    1.0     293
8  Class B        London   Jemma   x   x   32.2       0
9  Class B        London   James   x   x   32.0       0

In [18]: res.to_csv('foo.csv')

Run Code Online (Sandbox Code Playgroud)

注意:要强制dtype为object(并且具有混合dtypes,int和浮点数,而不是所有浮点数),您可以使用apply.如果您正在进行任何分析,我会建议不要这样做!

res['Score'] = res['Score'].apply(lambda x: int(x) if int(x) == x else x, convert_dtype=False)

Run Code Online (Sandbox Code Playgroud)

希望这段代码更具可读性.;)Python的新Enum类型的后端在这里.

from enum import Enum       # see PyPI for the backport (enum34)

class Field(Enum):

    course = 0
    location = 1
    student = 2
    dpe = 3
    jjk = 4
    score = -2
    number = -1

    def __index__(self):
        return self._value_

def Float(text):
    if not text:
        return 0.0
    return float(text)

def load_our_data(filename):
    "return a dict using the first three fields as the key"
    data = dict()
    with open(filename) as input:
        next(input)  # throw away header
        for line in input:
            fields = line.strip('\n').split(',')
            fields[Field.score] = Float(fields[Field.score])
            fields[Field.number] = Float(fields[Field.number])
            key = (
                fields[Field.course].lower(),
                fields[Field.location][:4].lower(),
                fields[Field.student].lower(),
                )
            data[key] = fields
    return data

def load_their_data(filename):
    "return a dict using the first three fields as the key"
    data = dict()
    with open(filename) as input:
        next(input)  # throw away header
        for line in input:
            fields = line.strip('\n').split(',')
            fields = [f.strip('"') for f in fields]
            fields[Field.score] = Float(fields[Field.score])
            fields[Field.number] = Float(fields[Field.number])
            key = (
                fields[Field.course].lower(),
                fields[Field.location][:4].lower(),
                fields[Field.student].lower(),
                )
            data[key] = fields
    return data

def merge_data(ours, theirs):
    "their data is only used if not blank and non-zero"
    for key, our_data in ours.items():
        their_data = theirs[key]
        if their_data[Field.score]:
            our_data[Field.score] = their_data[Field.score]
        if their_data[Field.number]:
            our_data[Field.number] = their_data[Field.number]

def write_our_data(data, filename):
    with open(filename, 'w') as output:
        for record in sorted(data.values()):
            line = ','.join([str(f) for f in record])
            output.write(line + '\n')

if __name__ == '__main__':
    ours = load_our_data('one.csv')
    theirs = load_their_data('two.csv')
    merge_data(ours, theirs)
    write_our_data(ours, 'three.csv')

Run Code Online (Sandbox Code Playgroud)

Python 字典是解决这个问题的方法：

studentDict = {}

with open(<csv1>, 'r') as f:
  for line in f:
    LL = line.rstrip('\n').replace('"','').split(',')
    studentDict[LL[0], LL[1], LL[2]] = LL[3:]

with open(<csv2>, 'r') as f:
  for line in f:
    LL = line.rstrip('\n').replace('"','').split(',')
    if LL[-2] not in ('0', ''): studentDict[LL[0], LL[1], LL[2]][-2] = LL[-2]
    if LL[-1] not in ('0', ''): studentDict[LL[0], LL[1], LL[2]][-1] = LL[-1]

with open(<outFile>, 'w') as f:
  for k in studentDict.keys():
    v = studentDict[k[0], k[1], k[2]]
    f.write(k[0] + ',' + k[1] + ',' + k[2] + ',' + v[0] + ',' + v[1] + ',' + v[2] + ',' + v[3] + '\n')

Run Code Online (Sandbox Code Playgroud)

归档时间：	12 年，7 月前
查看次数：	407 次
最近记录：	12 年，6 月前

如何规范化名称 6

更多相关链接

len()和.__ len __()之间的区别？ 84

在Python和sqlite中转义字符 44

如何在调试模式下调用pyspark？ 19

如何使用字典理解来计算文档中每个单词的出现次数 15

将Pandas DataFrame中的列与DataFrame中的列列相结合 11

Pandas 数据框：使用另外 2 列创建一个新列，该列是自定义函数 8

pandas对于带有基数为10的long()的文字无效 7

Celery-工人关闭后如何更新任务状态？ 6

如何删除非字母字符的每个单词 5

如何在 Pandas 数据框中获取列表的长度 5

为什么Android模拟器这么慢？我们如何加快Android模拟器的速度？ 3356

如何在JavaScript中将字符串转换为布尔值？ 2328

2048游戏的最佳算法是什么？ 1893

如何将多行中的文本连接成SQL Server中的单个文本字符串？ 1813

JavaScript检查变量是否存在(定义/初始化) 1642

什么是TypeScript,为什么我会用它代替JavaScript？ 1637

在Bash中循环遍历一串字符串？ 1371

你如何存放未跟踪的文件？ 1287

在Android Studio中重命名包 1252

如何随机化(shuffle)一个JavaScript数组？ 1138