Oli*_*eph 5 python csv python-2.7 pandas
我正在尝试使用其他来源提供的一些学生数据来更新csv文件,但是他们已经将他们的csv数据格式化为与我们的略有不同.
它需要根据三个标准匹配学生的名字,他们的班级,最后是该地点的前几个字母,所以对于B班的前几个学生Dumpt来说实际上是Dumpton Park.
找到匹配项时
以下是一些示例数据:
Class,Local,Name,DPE,JJK,Score,No
Class A,York,Tom,x,x,32,
Class A,York,Jim,x,x,10,
Class A,York,Sam,x,x,32,
Class B,Dumpton Park,Sarah,x,x,,
Class B,Dumpton Park,Bob,x,x,,
Class B,Dumpton Park,Bill,x,x,,
Class A,Dover,Andy,x,x,,
Class A,Dover,Hannah,x,x,,
Class B,London,Jemma,x,x,,
Class B,London,James,x,x,,
Run Code Online (Sandbox Code Playgroud)
"Class","Location","Student","Scorecard","Number"
"Class A","York","Jim","0","742"
"Class A","York","Sam","0","931"
"Class A","York","Tom","0","653"
"Class B","Dumpt","Bob","23.1","299"
"Class B","Dumpt","Bill","23.4","198"
"Class B","Dumpt","Sarah","23.5","12"
"Class A","Dover","Andy","23","983"
"Class A","Dover","Hannah","1","293"
"Class B","Lond","Jemma","32.2","0"
"Class B","Lond","James","32.0","0"
Run Code Online (Sandbox Code Playgroud)
Class,Local,Name,DPE,JJK,Score,No
Class A,York,Tom,x,x,32,653
Class A,York,Jim,x,x,10,742
Class A,York,Sam,x,x,32,653
Class B,Dumpton Park,Sarah,x,x,23.5,12
Class B,Dumpton Park,Bob,x,x,23.1,299
Class B,Dumpton Park,Bill,x,x,23.4,198
Class A,Dover,Andy,x,x,23,983
Class A,Dover,Hannah,x,x,1,293
Class B,London,Jemma,x,x,32.2,
Class B,London,James,x,x,32.0,
Run Code Online (Sandbox Code Playgroud)
我真的很感激这个问题的任何帮助.谢谢奥利弗
这里有两个解决方案:一个熊猫解决方案和一个普通的python解决方案.首先是一个熊猫解决方案,不出所料看起来像其他熊猫解决方案一样......
首先加载数据
import pandas
import numpy as np
cdf1 = pandas.read_csv('csv1',dtype=object) #dtype = object allows us to preserve the numeric formats
cdf2 = pandas.read_csv('csv2',dtype=object)
col_order = cdf1.columns #pandas will shuffle the column order at some point---this allows us to reset ot original column order
Run Code Online (Sandbox Code Playgroud)
此时数据框将如下所示
In [6]: cdf1
Out[6]:
Class Local Name DPE JJK Score No
0 Class A York Tom x x 32 NaN
1 Class A York Jim x x 10 NaN
2 Class A York Sam x x 32 NaN
3 Class B Dumpton Park Sarah x x NaN NaN
4 Class B Dumpton Park Bob x x NaN NaN
5 Class B Dumpton Park Bill x x NaN NaN
6 Class A Dover Andy x x NaN NaN
7 Class A Dover Hannah x x NaN NaN
8 Class B London Jemma x x NaN NaN
9 Class B London James x x NaN NaN
In [7]: cdf2
Out[7]:
Class Location Student Scorecard Number
0 Class A York Jim 0 742
1 Class A York Sam 0 931
2 Class A York Tom 0 653
3 Class B Dumpt Bob 23.1 299
4 Class B Dumpt Bill 23.4 198
5 Class B Dumpt Sarah 23.5 12
6 Class A Dover Andy 23 983
7 Class A Dover Hannah 1 293
8 Class B Lond Jemma 32.2 0
9 Class B Lond James 32.0 0
Run Code Online (Sandbox Code Playgroud)
接下来将数据帧处理为匹配格式.
dcol = cdf2.Location
cdf2['Location'] = dcol.apply(lambda x: x[0:4]) #Replacement in cdf2 since we don't need original data
dcol = cdf1.Local
cdf1['Location'] = dcol.apply(lambda x: x[0:4]) #Here we add a column leaving 'Local' because we'll need it for the final output
cdf2 = cdf2.rename(columns={'Student': 'Name', 'Scorecard': 'Score', 'Number': 'No'})
cdf2 = cdf2.replace('0', np.nan) #Replacing '0' by np.nan means zeros don't overwrite
cdf1 = cdf1.set_index(['Class', 'Location', 'Name'])
cdf2 = cdf2.set_index(['Class', 'Location', 'Name'])
Run Code Online (Sandbox Code Playgroud)
现在cdf1和cdf2看起来像
In [16]: cdf1
Out[16]:
Local DPE JJK Score No
Class Location Name
Class A York Tom York x x 32 NaN
Jim York x x 10 NaN
Sam York x x 32 NaN
Class B Dump Sarah Dumpton Park x x NaN NaN
Bob Dumpton Park x x NaN NaN
Bill Dumpton Park x x NaN NaN
Class A Dove Andy Dover x x NaN NaN
Hannah Dover x x NaN NaN
Class B Lond Jemma London x x NaN NaN
James London x x NaN NaN
In [17]: cdf2
Out[17]:
Score No
Class Location Name
Class A York Jim NaN 742
Sam NaN 931
Tom NaN 653
Class B Dump Bob 23.1 299
Bill 23.4 198
Sarah 23.5 12
Class A Dove Andy 23 983
Hannah 1 293
Class B Lond Jemma 32.2 NaN
James 32.0 NaN
Run Code Online (Sandbox Code Playgroud)
使用cdf2中的数据更新cdf1中的数据
cdf1.update(cdf2, overwrite=False)
Run Code Online (Sandbox Code Playgroud)
结果是
In [19]: cdf1
Out[19]:
Local DPE JJK Score No
Class Location Name
Class A York Tom York x x 32 653
Jim York x x 10 742
Sam York x x 32 931
Class B Dump Sarah Dumpton Park x x 23.5 12
Bob Dumpton Park x x 23.1 299
Bill Dumpton Park x x 23.4 198
Class A Dove Andy Dover x x 23 983
Hannah Dover x x 1 293
Class B Lond Jemma London x x 32.2 NaN
James London x x 32.0 NaN
Run Code Online (Sandbox Code Playgroud)
最后将cdf1返回到它的原始格式并将其写入csv文件.
cdf1 = cdf1.reset_index() #These two steps allow us to remove the 'Location' column
del cdf1['Location']
cdf1 = cdf1[col_order] #This will switch Local and Name back to their original order
cdf1.to_csv('temp.csv',index = False)
Run Code Online (Sandbox Code Playgroud)
两个注意事项:首先,考虑到使用cdf1.Local.value_counts()或len(cdf1.Local.value_counts())等是多么容易.我强烈建议添加一些检查求和以确保从Location转移到位置的前几个字母,你不是偶然消除了一个位置.其次,我真诚地希望你所需输出的第4行有一个拼写错误.
在一个普通的python解决方案上.在下面,根据需要调整文件名.
#Open all of the necessary files
csv1 = open('csv1','r')
csv2 = open('csv2','r')
csvout = open('csv_out','w')
#Read past both headers and write the header to the outfile
wstr = csv1.readline()
csvout.write(wstr)
csv2.readline()
#Read csv1 into a dictionary with keys of Class,Name,and first four digits of Local and keep a list of keys for line ordering
line_keys = []
line_dict = {}
for line in csv1:
s = line.split(',')
this_key = (s[0],s[1][0:4],s[2])
line_dict[this_key] = s
line_keys.append(this_key)
#Go through csv2 updating the data in csv1 as necessary
for line in csv2:
s = line.replace('\"','').split(',')
this_key = (s[0],s[1][0:4],s[2])
if this_key in line_dict: #Lowers the crash rate...
#Check if need to replace Score...
if len(s[3]) > 0 and float(s[3]) != 0:
line_dict[this_key][5] = s[3]
#Check if need to repace No...
if len(s[4]) > 0 and float(s[4]) != 0:
line_dict[this_key][6] = s[4]
else:
print "Line not in csv1: %s"%line
#Write the updated line_dict to csvout
for key in line_keys:
wstr = ','.join(line_dict[key])
csvout.write(wstr)
csvout.write('\n')
#Close all of the open filehandles
csv1.close()
csv2.close()
csvout.close()
Run Code Online (Sandbox Code Playgroud)
您可以使用fuzzywuzzy进行城镇名称的匹配,并将其作为列附加到df2:
df1 = pd.read_csv(csv1)
df2 = pd.read_csv(csv2)
towns = df1.Local.unique() # assuming this is complete list of towns
from fuzzywuzzy.fuzz import partial_ratio
In [11]: df2['Local'] = df2.Location.apply(lambda short_location: max(towns, key=lambda t: partial_ratio(short_location, t)))
In [12]: df2
Out[12]:
Class Location Student Scorecard Number Local
0 Class A York Jim 0.0 742 York
1 Class A York Sam 0.0 931 York
2 Class A York Tom 0.0 653 York
3 Class B Dumpt Bob 23.1 299 Dumpton Park
4 Class B Dumpt Bill 23.4 198 Dumpton Park
5 Class B Dumpt Sarah 23.5 12 Dumpton Park
6 Class A Dover Andy 23.0 983 Dover
7 Class A Dover Hannah 1.0 293 Dover
8 Class B Lond Jemma 32.2 0 London
9 Class B Lond James 32.0 0 London
Run Code Online (Sandbox Code Playgroud)
使名称保持一致(目前学生和姓名被错误命名):
In [13]: df2.rename_axis({'Student': 'Name'}, axis=1, inplace=True)
Run Code Online (Sandbox Code Playgroud)
现在您可以合并(在重叠列上):
In [14]: res = df1.merge(df2, how='outer')
In [15]: res
Out[15]:
Class Local Name DPE JJK Score No Location Scorecard Number
0 Class A York Tom x x 32 NaN York 0.0 653
1 Class A York Jim x x 10 NaN York 0.0 742
2 Class A York Sam x x 32 NaN York 0.0 931
3 Class B Dumpton Park Sarah x x NaN NaN Dumpt 23.5 12
4 Class B Dumpton Park Bob x x NaN NaN Dumpt 23.1 299
5 Class B Dumpton Park Bill x x NaN NaN Dumpt 23.4 198
6 Class A Dover Andy x x NaN NaN Dover 23.0 983
7 Class A Dover Hannah x x NaN NaN Dover 1.0 293
8 Class B London Jemma x x NaN NaN Lond 32.2 0
9 Class B London James x x NaN NaN Lond 32.0 0
Run Code Online (Sandbox Code Playgroud)
有一点要清理的是得分,我想我会把两者中的最大值:
In [16]: res['Score'] = res.loc[:, ['Score', 'Scorecard']].max(1)
In [17]: del res['Scorecard']
del res['No']
del res['Location']
Run Code Online (Sandbox Code Playgroud)
然后你留下你想要的列:
In [18]: res
Out[18]:
Class Local Name DPE JJK Score Number
0 Class A York Tom x x 32.0 653
1 Class A York Jim x x 10.0 742
2 Class A York Sam x x 32.0 931
3 Class B Dumpton Park Sarah x x 23.5 12
4 Class B Dumpton Park Bob x x 23.1 299
5 Class B Dumpton Park Bill x x 23.4 198
6 Class A Dover Andy x x 23.0 983
7 Class A Dover Hannah x x 1.0 293
8 Class B London Jemma x x 32.2 0
9 Class B London James x x 32.0 0
In [18]: res.to_csv('foo.csv')
Run Code Online (Sandbox Code Playgroud)
注意:要强制dtype为object(并且具有混合dtypes,int和浮点数,而不是所有浮点数),您可以使用apply.如果您正在进行任何分析,我会建议不要这样做!
res['Score'] = res['Score'].apply(lambda x: int(x) if int(x) == x else x, convert_dtype=False)
Run Code Online (Sandbox Code Playgroud)
希望这段代码更具可读性.;)Python的新Enum类型的后端在这里.
from enum import Enum # see PyPI for the backport (enum34)
class Field(Enum):
course = 0
location = 1
student = 2
dpe = 3
jjk = 4
score = -2
number = -1
def __index__(self):
return self._value_
def Float(text):
if not text:
return 0.0
return float(text)
def load_our_data(filename):
"return a dict using the first three fields as the key"
data = dict()
with open(filename) as input:
next(input) # throw away header
for line in input:
fields = line.strip('\n').split(',')
fields[Field.score] = Float(fields[Field.score])
fields[Field.number] = Float(fields[Field.number])
key = (
fields[Field.course].lower(),
fields[Field.location][:4].lower(),
fields[Field.student].lower(),
)
data[key] = fields
return data
def load_their_data(filename):
"return a dict using the first three fields as the key"
data = dict()
with open(filename) as input:
next(input) # throw away header
for line in input:
fields = line.strip('\n').split(',')
fields = [f.strip('"') for f in fields]
fields[Field.score] = Float(fields[Field.score])
fields[Field.number] = Float(fields[Field.number])
key = (
fields[Field.course].lower(),
fields[Field.location][:4].lower(),
fields[Field.student].lower(),
)
data[key] = fields
return data
def merge_data(ours, theirs):
"their data is only used if not blank and non-zero"
for key, our_data in ours.items():
their_data = theirs[key]
if their_data[Field.score]:
our_data[Field.score] = their_data[Field.score]
if their_data[Field.number]:
our_data[Field.number] = their_data[Field.number]
def write_our_data(data, filename):
with open(filename, 'w') as output:
for record in sorted(data.values()):
line = ','.join([str(f) for f in record])
output.write(line + '\n')
if __name__ == '__main__':
ours = load_our_data('one.csv')
theirs = load_their_data('two.csv')
merge_data(ours, theirs)
write_our_data(ours, 'three.csv')
Run Code Online (Sandbox Code Playgroud)
Python 字典是解决这个问题的方法:
studentDict = {}
with open(<csv1>, 'r') as f:
for line in f:
LL = line.rstrip('\n').replace('"','').split(',')
studentDict[LL[0], LL[1], LL[2]] = LL[3:]
with open(<csv2>, 'r') as f:
for line in f:
LL = line.rstrip('\n').replace('"','').split(',')
if LL[-2] not in ('0', ''): studentDict[LL[0], LL[1], LL[2]][-2] = LL[-2]
if LL[-1] not in ('0', ''): studentDict[LL[0], LL[1], LL[2]][-1] = LL[-1]
with open(<outFile>, 'w') as f:
for k in studentDict.keys():
v = studentDict[k[0], k[1], k[2]]
f.write(k[0] + ',' + k[1] + ',' + k[2] + ',' + v[0] + ',' + v[1] + ',' + v[2] + ',' + v[3] + '\n')
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
407 次 |
| 最近记录: |