我正在尝试清理excel文件中的一些数据.该文件包含7400行和18列,其中包括具有各自地址和其他数据的客户列表.我遇到的问题是,有些城市拼写错误,导致信息失真,难以进一步处理.
SURNAME | ADDRESS | CITY
0 Jenson | 252 Des Chênes | D.DO
1 Jean | 236 Gouin | DOLLARD
2 Denis | 993 Boul. Gouin | DOLLARD-DES-ORMEAUX
3 Bradford | 1690 Dollard #7 | DDO
4 Alisson | 115 Du Buisson | IL PERROT
5 Abdul | 9877 Boul. Gouin | Pierrefonds
6 O'Neil | 5 Du College | Ile Bizard
7 Bundy | 7345 Sherbrooke | ILLE Perot
8 Darcy | 8671 Anthony #2 | ILE Perrot
9 Adams | 845 Georges | Pierrefonds
Run Code Online (Sandbox Code Playgroud)
在上面的示例中,D.DO,DOLLARD,DDO应拼写为DOLLARD-DES-ORMEAUX,IL PERROT,ILLE PEROT,ILE PERROT应拼写为ILE-PERROT.
我已经能够使用以下方法替换值:
df["CITY"].replace(to_replace={"D.DO", "DOLLARD", "DDO"}, value="DOLLARD-DES-ORMEAUX", regex=True)
df["CITY"].replace(to_replace={"IL PERROT", "ILLE PEROT", "ILE PERROT"}, value="ILE-PERROT", regex=True)
Run Code Online (Sandbox Code Playgroud)
有没有办法将上述操作合二为一?我试过了:
df["CITY"].replace({to_replace={"D.DO", "DOLLARD", "DDO"}, value="DOLLARD-DES-ORMEAUX", to_replace={"IL PERROT", "ILLE PEROT", "ILE PERROT"}, value="ILE-PERROT"}, regex=True)
Run Code Online (Sandbox Code Playgroud)
但我没有运气
Max*_*axU 15
replacements = {
'CITY': {
r'(D.*DO|DOLLARD.*)': 'DOLLARD-DES-ORMEAUX',
r'I[lL]*[eE]*.*': 'ILLE Perot'}
}
df.replace(replacements, regex=True, inplace=True)
print(df)
Run Code Online (Sandbox Code Playgroud)
输出:
SURNAME ADDRESS CITY
0 Jenson 252 Des Ch?¬nes DOLLARD-DES-ORMEAUX
1 Jean 236 Gouin DOLLARD-DES-ORMEAUX
2 Denis 993 Boul. Gouin DOLLARD-DES-ORMEAUX
3 Bradford 1690 Dollard #7 DOLLARD-DES-ORMEAUX
4 Alisson 115 Du Buisson ILLE Perot
5 Abdul 9877 Boul. Gouin Pierrefonds
6 O'Neil 5 Du College ILLE Perot
7 Bundy 7345 Sherbrooke ILLE Perot
8 Darcy 8671 Anthony #2 ILLE Perot
9 Adams 845 Georges Pierrefonds
Run Code Online (Sandbox Code Playgroud)