Jst*_*uff 10 python fuzzy-search pandas fuzzywuzzy
我有两个数据帧,每个数据帧都有不同的行数.下面是每个数据集的几行
df1 =
Company City State ZIP
FREDDIE LEES AMERICAN GOURMET SAUCE St. Louis MO 63101
CITYARCHRIVER 2015 FOUNDATION St. Louis MO 63102
GLAXOSMITHKLINE CONSUMER HEALTHCARE St. Louis MO 63102
LACKEY SHEET METAL St. Louis MO 63102
Run Code Online (Sandbox Code Playgroud)
和
df2 =
FDA Company FDA City FDA State FDA ZIP
LACKEY SHEET METAL St. Louis MO 63102
PRIMUS STERILIZER COMPANY LLC Great Bend KS 67530
HELGET GAS PRODUCTS INC Omaha NE 68127
ORTHOQUEST LLC La Vista NE 68128
Run Code Online (Sandbox Code Playgroud)
我和他们并肩使用combined_data = pandas.concat([df1, df2], axis = 1).我的下一个目标是使用模块中的几个不同匹配命令比较每个字符串下df1['Company']的每个字符串,并返回最佳匹配的值及其名称.我想将它存储在一个新列中.举例来说,如果我做了,并在中到它会返回最匹配的是一个得分,这将随后在新列下保存.结果看起来像df2['FDA Company']fuzzy wuzzyfuzz.ratiofuzz.token_sort_ratioLACKY SHEET METALdf1['Company']df2['FDA Company']LACKY SHEET METAL100combined data
combined_data =
Company City State ZIP FDA Company FDA City FDA State FDA ZIP fuzzy.token_sort_ratio match fuzzy.ratio match
FREDDIE LEES AMERICAN GOURMET SAUCE St. Louis MO 63101 LACKEY SHEET METAL St. Louis MO 63102 LACKEY SHEET METAL 100 LACKEY SHEET METAL 100
CITYARCHRIVER 2015 FOUNDATION St. Louis MO 63102 PRIMUS STERILIZER COMPANY LLC Great Bend KS 67530
GLAXOSMITHKLINE CONSUMER HEALTHCARE St. Louis MO 63102 HELGET GAS PRODUCTS INC Omaha NE 68127
LACKEY SHEET METAL St. Louis MO 63102 ORTHOQUEST LLC La Vista NE 68128
Run Code Online (Sandbox Code Playgroud)
我试过了
combined_data['name_ratio'] = combined_data.apply(lambda x: fuzz.ratio(x['Company'], x['FDA Company']), axis = 1)
Run Code Online (Sandbox Code Playgroud)
但由于列的长度不同而出现错误.
我很难过.我怎么能做到这一点?
piR*_*red 13
我不知道你在做什么.我就是这样做的.
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
Run Code Online (Sandbox Code Playgroud)
创建一系列要比较的元组:
compare = pd.MultiIndex.from_product([df1['Company'],
df2['FDA Company']]).to_series()
Run Code Online (Sandbox Code Playgroud)
创建一个特殊函数来计算模糊度量并返回一个序列.
def metrics(tup):
return pd.Series([fuzz.ratio(*tup),
fuzz.token_sort_ratio(*tup)],
['ratio', 'token'])
Run Code Online (Sandbox Code Playgroud)
适用metrics于该compare系列
compare.apply(metrics)
Run Code Online (Sandbox Code Playgroud)
下一部分有很多方法可以做到这一点:
获得与每一行最接近的匹配 df1
compare.apply(metrics).unstack().idxmax().unstack(0)
Run Code Online (Sandbox Code Playgroud)
获得与每一行最接近的匹配 df2
compare.apply(metrics).unstack(0).idxmax().unstack(0)
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
6101 次 |
| 最近记录: |