是否可以与python pandas进行模糊匹配合并？

Question

是否可以与python pandas进行模糊匹配合并？

我有两个DataFrames,我想根据列合并.然而,由于交替拼写,不同数量的空格,不存在/存在变音符号,我希望能够合并,只要它们彼此相似即可.

任何相似性算法都可以(soundex,Levenshtein,difflib).

假设一个DataFrame具有以下数据:

df1 = DataFrame([[1],[2],[3],[4],[5]], index=['one','two','three','four','five'], columns=['number'])

       number
one         1
two         2
three       3
four        4
five        5

df2 = DataFrame([['a'],['b'],['c'],['d'],['e']], index=['one','too','three','fours','five'], columns=['letter'])

      letter
one        a
too        b
three      c
fours      d
five       e

Run Code Online (Sandbox Code Playgroud)

然后我想得到生成的DataFrame

       number letter
one         1      a
two         2      b
three       3      c
four        4      d
five        5      e

Run Code Online (Sandbox Code Playgroud)

Answer 1

And*_*den 67

类似@locojay建议,你可以申请difflib的get_close_matches到df2的指标,然后应用join:

In [23]: import difflib 

In [24]: difflib.get_close_matches
Out[24]: <function difflib.get_close_matches>

In [25]: df2.index = df2.index.map(lambda x: difflib.get_close_matches(x, df1.index)[0])

In [26]: df2
Out[26]: 
      letter
one        a
two        b
three      c
four       d
five       e

In [31]: df1.join(df2)
Out[31]: 
       number letter
one         1      a
two         2      b
three       3      c
four        4      d
five        5      e

Run Code Online (Sandbox Code Playgroud)

.

如果这些是列,那么您可以应用于列,然后merge:

df1 = DataFrame([[1,'one'],[2,'two'],[3,'three'],[4,'four'],[5,'five']], columns=['number', 'name'])
df2 = DataFrame([['a','one'],['b','too'],['c','three'],['d','fours'],['e','five']], columns=['letter', 'name'])

df2['name'] = df2['name'].apply(lambda x: difflib.get_close_matches(x, df1['name'])[0])
df1.merge(df2)

Run Code Online (Sandbox Code Playgroud)

如果有几个匹配则不起作用. (3认同)
对于那些说它失败的人，我认为这更多的是如何将其实现到管道中的问题，而不是解决方案的错误，这是简单而优雅的。 (3认同)
您可以使用n = 1将结果限制为1。https://docs.python.org/3/library/difflib.html#difflib.get_close_matches (2认同)
如果两个数据帧的长度不同，该如何处理？ (2认同)

Answer 2

Rob*_*inL 13

我写了一个Python包,旨在解决这个问题:

pip install fuzzymatcher

你可以找到回购这里和文档在这里.

基本用法:

给定两个dataframes df_left和df_right,要模糊加入,你可以写:

from fuzzymatcher import link_table, fuzzy_left_join

# Columns to match on from df_left
left_on = ["fname", "mname", "lname",  "dob"]

# Columns to match on from df_right
right_on = ["name", "middlename", "surname", "date"]

# The link table potentially contains several matches for each record
fuzzymatcher.link_table(df_left, df_right, left_on, right_on)

Run Code Online (Sandbox Code Playgroud)

或者,如果您只想链接最接近的匹配:

fuzzymatcher.fuzzy_left_join(df_left, df_right, left_on, right_on)

Run Code Online (Sandbox Code Playgroud)

老实说，如果它没有那么多依赖项，那就太棒了，首先我必须安装 Visual Studio 构建工具，现在我收到错误：“没有这样的模块：fts4” (3认同)
`名称'fuzzymatcher'未定义` (2认同)
@RobinL 您能否详细说明一下如何解决：“没有这样的模块：fts4”问题？我一直在尝试做这件事，但成功率为零。 (2认同)

Answer 3

yat*_*atu 11

对于一般方法：`fuzzy_merge`

对于更一般的场景，我们想要合并两个包含稍微不同字符串的数据帧中的列，以下函数使用difflib.get_close_matcheswithmerge来模仿 pandas 的功能，merge但具有模糊匹配：

import difflib 

def fuzzy_merge(df1, df2, left_on, right_on, how='inner', cutoff=0.6):
    df_other= df2.copy()
    df_other[left_on] = [get_closest_match(x, df1[left_on], cutoff) 
                         for x in df_other[right_on]]
    return df1.merge(df_other, on=left_on, how=how)

def get_closest_match(x, other, cutoff):
    matches = difflib.get_close_matches(x, other, cutoff=cutoff)
    return matches[0] if matches else None

Run Code Online (Sandbox Code Playgroud)

以下是带有两个示例数据帧的一些用例：

print(df1)

     key   number
0    one       1
1    two       2
2  three       3
3   four       4
4   five       5

print(df2)

                 key_close  letter
0                    three      c
1                      one      a
2                      too      b
3                    fours      d
4  a very different string      e

Run Code Online (Sandbox Code Playgroud)

通过上面的例子，我们会得到：

fuzzy_merge(df1, df2, left_on='key', right_on='key_close')

     key  number key_close letter
0    one       1       one      a
1    two       2       too      b
2  three       3     three      c
3   four       4     fours      d

Run Code Online (Sandbox Code Playgroud)

我们可以进行左连接：

fuzzy_merge(df1, df2, left_on='key', right_on='key_close', how='left')

     key  number key_close letter
0    one       1       one      a
1    two       2       too      b
2  three       3     three      c
3   four       4     fours      d
4   five       5       NaN    NaN

Run Code Online (Sandbox Code Playgroud)

对于右连接，我们将在左数据帧中拥有所有不匹配的键None：

fuzzy_merge(df1, df2, left_on='key', right_on='key_close', how='right')

     key  number                key_close letter
0    one     1.0                      one      a
1    two     2.0                      too      b
2  three     3.0                    three      c
3   four     4.0                    fours      d
4   None     NaN  a very different string      e

Run Code Online (Sandbox Code Playgroud)

另请注意，如果截止范围内没有匹配的项目，将返回一个空列表。在共享示例中，如果我们将最后一个索引更改为：difflib.get_close_matches df2

print(df2) letter one a too b three c fours d a very different string e
Run Code Online (Sandbox Code Playgroud)
我们会得到一个index out of range错误：

df2.index.map(lambda x: difflib.get_close_matches(x, df1.index)[0])
Run Code Online (Sandbox Code Playgroud)

IndexError：列表索引超出范围

为了解决这个问题，上面的函数将通过仅在实际上包含任何匹配项时get_closest_match对返回的列表进行索引来返回最接近的匹配项。difflib.get_close_matches

Answer 4

los*_*l29 9

我会使用Jaro-Winkler,因为它是目前可用的性能最高且最准确的近似字符串匹配算法之一[ Cohen,et al.],[ 温克勒 ].

这就是我用水母包中的Jaro-Winkler做的方法:

def get_closest_match(x, list_strings):

  best_match = None
  highest_jw = 0

  for current_string in list_strings:
    current_score = jellyfish.jaro_winkler(x, current_string)

    if(current_score > highest_jw):
      highest_jw = current_score
      best_match = current_string

  return best_match

df1 = pandas.DataFrame([[1],[2],[3],[4],[5]], index=['one','two','three','four','five'], columns=['number'])
df2 = pandas.DataFrame([['a'],['b'],['c'],['d'],['e']], index=['one','too','three','fours','five'], columns=['letter'])

df2.index = df2.index.map(lambda x: get_closest_match(x, df1.index))

df1.join(df2)

Run Code Online (Sandbox Code Playgroud)

输出:

    number  letter
one     1   a
two     2   b
three   3   c
four    4   d
five    5   e

Run Code Online (Sandbox Code Playgroud)

有什么方法可以加快速度吗？此代码不能很好地扩展. (2认同)

Answer 5

Erf*_*fan 8

使用 `fuzzywuzzy`

2019年答案

由于该fuzzywuzzy软件包没有示例，因此我编写了一个函数，该函数将根据您可以设置为用户的阈值返回所有匹配项：

示例datframe

df1 = pd.DataFrame({'Key':['Apple', 'Banana', 'Orange', 'Strawberry']})
df2 = pd.DataFrame({'Key':['Aple', 'Mango', 'Orag', 'Straw', 'Bannanna', 'Berry']})

# df1
          Key
0       Apple
1      Banana
2      Orange
3  Strawberry

# df2
        Key
0      Aple
1     Mango
2      Orag
3     Straw
4  Bannanna
5     Berry

Run Code Online (Sandbox Code Playgroud)

模糊匹配功能

def fuzzy_merge(df_1, df_2, key1, key2, threshold=90, limit=2):
    """
    df_1 is the left table to join
    df_2 is the right table to join
    key1 is the key column of the left table
    key2 is the key column of the right table
    threshold is how close the matches should be to return a match, based on Levenshtein distance
    limit is the amount of matches that will get returned, these are sorted high to low
    """
    s = df_2[key2].tolist()

    m = df_1[key1].apply(lambda x: process.extract(x, s, limit=limit))    
    df_1['matches'] = m

    m2 = df_1['matches'].apply(lambda x: ', '.join([i[0] for i in x if i[1] >= threshold]))
    df_1['matches'] = m2

    return df_1

Run Code Online (Sandbox Code Playgroud)

在数据帧上使用我们的函数： ＃1

from fuzzywuzzy import fuzz
from fuzzywuzzy import process

fuzzy_merge(df1, df2, 'Key', 'Key', threshold=80)

          Key       matches
0       Apple          Aple
1      Banana      Bannanna
2      Orange          Orag
3  Strawberry  Straw, Berry

Run Code Online (Sandbox Code Playgroud)

在数据框上使用我们的函数： ＃2

df1 = pd.DataFrame({'Col1':['Microsoft', 'Google', 'Amazon', 'IBM']})
df2 = pd.DataFrame({'Col2':['Mcrsoft', 'gogle', 'Amason', 'BIM']})

fuzzy_merge(df1, df2, 'Col1', 'Col2', 80)

        Col1  matches
0  Microsoft  Mcrsoft
1     Google    gogle
2     Amazon   Amason
3        IBM

Run Code Online (Sandbox Code Playgroud)

安装：

点子

pip install fuzzywuzzy

Run Code Online (Sandbox Code Playgroud)

水蟒

conda install -c conda-forge fuzzywuzzy

Run Code Online (Sandbox Code Playgroud)

有没有办法将 df2 的所有列都转移到比赛中？假设 c 是您想要保留的表 2 (df2) 的主键或外键 (5认同)

Answer 6

loc*_*jay 5

http://pandas.pydata.org/pandas-docs/dev/merging.html没有挂钩功能来动态执行此操作.虽然会很好......

我只是做一个单独的步骤并使用difflib getclosest_matches在2个数据帧之一中创建一个新列,并在模糊匹配列上创建合并/连接

您能解释一下如何使用`difflib.get_closest_matches`创建这样的列，然后在该列上合并吗？ (2认同)

Answer 7

red*_*ddy 5

我使用 Fuzzymatcher 包，这对我来说效果很好。请访问此链接了解更多详细信息。

使用以下命令进行安装

pip install fuzzymatcher

Run Code Online (Sandbox Code Playgroud)

下面是示例代码（上面已经由 RobinL 提交）

from fuzzymatcher import link_table, fuzzy_left_join

# Columns to match on from df_left
left_on = ["fname", "mname", "lname",  "dob"]

# Columns to match on from df_right
right_on = ["name", "middlename", "surname", "date"]

# The link table potentially contains several matches for each record
fuzzymatcher.link_table(df_left, df_right, left_on, right_on)

Run Code Online (Sandbox Code Playgroud)

您可能会遇到的错误

ZeroDivisionError: float 除以零---> 请参考此链接解决它
操作错误：没有这样的模块：fts4 --> 从这里下载 sqlite3.dll并替换 python 或 anaconda DLL 文件夹中的 DLL 文件。

优点：

工作速度更快。 就我而言，我将一个包含 3000 行的数据帧与另一个包含 170,000 条记录的数据帧进行了比较。这也使用 SQLite3 跨文本搜索。比许多人都快
可以检查多个列和 2 个数据框。就我而言，我正在根据地址和公司名称寻找最接近的匹配。有时，公司名称可能相同，但地址也是最好检查一下。
为您提供同一记录的所有最接近的比赛的得分。你选择什么是截止分数。

缺点：

原包安装有bug
还安装了所需的 C++ 和 Visual Studio
不适用于 64 位 anaconda/Python

Answer 8

cam*_*sia 5

有一个名为的包fuzzy_pandas可以使用levenshtein、jaro和metaphone方法bilenco。这里有一些很好的例子

import pandas as pd
import fuzzy_pandas as fpd

df1 = pd.DataFrame({'Key':['Apple', 'Banana', 'Orange', 'Strawberry']})
df2 = pd.DataFrame({'Key':['Aple', 'Mango', 'Orag', 'Straw', 'Bannanna', 'Berry']})

results = fpd.fuzzy_merge(df1, df2,
            left_on='Key',
            right_on='Key',
            method='levenshtein',
            threshold=0.6)

results.head()

Run Code Online (Sandbox Code Playgroud)

  Key    Key
0 Apple  Aple
1 Banana Bannanna
2 Orange Orag

Run Code Online (Sandbox Code Playgroud)

归档时间：	13 年前
查看次数：	35468 次
最近记录：	6 年，3 月前

是否可以与python pandas进行模糊匹配合并？

对于一般方法：fuzzy_merge

使用 fuzzywuzzy

2019年答案

安装：

对于一般方法：`fuzzy_merge`

使用 `fuzzywuzzy`