如何基于相似度函数合并两个pandas DataFrames？

Question

如何基于相似度函数合并两个pandas DataFrames？

Pas*_*ten 7 python merge fuzzy-comparison pandas

给定数据集 1

name,x,y
st. peter,1,2
big university portland,3,4

Run Code Online (Sandbox Code Playgroud)

和数据集 2

name,x,y
saint peter3,4
uni portland,5,6

Run Code Online (Sandbox Code Playgroud)

目标是合并

d1.merge(d2, on="name", how="left")

Run Code Online (Sandbox Code Playgroud)

虽然没有完全匹配的名称。所以我想做一种模糊匹配。在这种情况下，技术无关紧要，更重要的是如何将其有效地合并到 Pandas 中。

例如，st. peter可能与saint peter另一个匹配，但big university portland可能偏差太大，我们不会将其与uni portland.

考虑它的一种方法是允许以最低的 Levenshtein 距离加入，但前提是编辑次数低于 5（st. --> saint是 4）。

生成的数据框应仅包含 row st. peter，并包含“名称”变体以及x和y变量。

有没有办法使用熊猫进行这种合并？

Answer 1

maj*_*ajr 5

你看过fuzzywuzzy吗？

您可能会执行以下操作：

import pandas as pd
import fuzzywuzzy.process as fwp

choices = list(df2.name)

def fmatch(row): 
    minscore=95 #or whatever score works for you
    choice,score = fwp.extractOne(row.name,choices)
    return choice if score > minscore else None

df1['df2_name'] = df1.apply(fmatch,axis=1)
merged = pd.merge(df1, 
                  df2,
                  left_on='df2_name',
                  right_on='name',
                  suffixes=['_df1','_df2'],
                  how = 'outer') # assuming you want to keep unmatched records

Run Code Online (Sandbox Code Playgroud)

警告 Emptor：我没有尝试运行它。

Answer 2

Sto*_*ica 1

假设您有一个返回最佳匹配（如果有）的函数，否则返回 None：

def best_match(s, candidates):
    ''' Return the item in candidates that best matches s.

    Will return None if a good enough match is not found.
    '''
    # Some code here.

Run Code Online (Sandbox Code Playgroud)

然后你可以加入它返回的值，但是你可以用不同的方式来实现，这会导致不同的输出（所以我认为，我没有太多地关注这个问题）：

(df1.assign(name=df1['name'].apply(lambda x: best_match(x, df2['name'])))
 .merge(df2, on='name', how='left'))

(df1.merge(df2.assign(name=df2['name'].apply(lambda x: best_match(x, df1['name'])))),
           on='name', how='left'))

Run Code Online (Sandbox Code Playgroud)

归档时间：	9 年，9 月前
查看次数：	4716 次
最近记录：	6 年，9 月前