我有两个DataFrames,我想根据列合并.然而,由于交替拼写,不同数量的空格,不存在/存在变音符号,我希望能够合并,只要它们彼此相似即可.
任何相似性算法都可以(soundex,Levenshtein,difflib).
假设一个DataFrame具有以下数据:
df1 = DataFrame([[1],[2],[3],[4],[5]], index=['one','two','three','four','five'], columns=['number'])
number
one 1
two 2
three 3
four 4
five 5
df2 = DataFrame([['a'],['b'],['c'],['d'],['e']], index=['one','too','three','fours','five'], columns=['letter'])
letter
one a
too b
three c
fours d
five e
Run Code Online (Sandbox Code Playgroud)
然后我想得到生成的DataFrame
number letter
one 1 a
two 2 b
three 3 c
four 4 d
five 5 e
Run Code Online (Sandbox Code Playgroud)
And*_*den 67
类似@locojay建议,你可以申请difflib
的get_close_matches
到df2
的指标,然后应用join
:
In [23]: import difflib
In [24]: difflib.get_close_matches
Out[24]: <function difflib.get_close_matches>
In [25]: df2.index = df2.index.map(lambda x: difflib.get_close_matches(x, df1.index)[0])
In [26]: df2
Out[26]:
letter
one a
two b
three c
four d
five e
In [31]: df1.join(df2)
Out[31]:
number letter
one 1 a
two 2 b
three 3 c
four 4 d
five 5 e
Run Code Online (Sandbox Code Playgroud)
.
如果这些是列,那么您可以应用于列,然后merge
:
df1 = DataFrame([[1,'one'],[2,'two'],[3,'three'],[4,'four'],[5,'five']], columns=['number', 'name'])
df2 = DataFrame([['a','one'],['b','too'],['c','three'],['d','fours'],['e','five']], columns=['letter', 'name'])
df2['name'] = df2['name'].apply(lambda x: difflib.get_close_matches(x, df1['name'])[0])
df1.merge(df2)
Run Code Online (Sandbox Code Playgroud)
Rob*_*inL 13
我写了一个Python包,旨在解决这个问题:
pip install fuzzymatcher
基本用法:
给定两个dataframes df_left
和df_right
,要模糊加入,你可以写:
from fuzzymatcher import link_table, fuzzy_left_join
# Columns to match on from df_left
left_on = ["fname", "mname", "lname", "dob"]
# Columns to match on from df_right
right_on = ["name", "middlename", "surname", "date"]
# The link table potentially contains several matches for each record
fuzzymatcher.link_table(df_left, df_right, left_on, right_on)
Run Code Online (Sandbox Code Playgroud)
或者,如果您只想链接最接近的匹配:
fuzzymatcher.fuzzy_left_join(df_left, df_right, left_on, right_on)
Run Code Online (Sandbox Code Playgroud)
yat*_*atu 11
fuzzy_merge
对于更一般的场景,我们想要合并两个包含稍微不同字符串的数据帧中的列,以下函数使用difflib.get_close_matches
withmerge
来模仿 pandas 的功能,merge
但具有模糊匹配:
import difflib
def fuzzy_merge(df1, df2, left_on, right_on, how='inner', cutoff=0.6):
df_other= df2.copy()
df_other[left_on] = [get_closest_match(x, df1[left_on], cutoff)
for x in df_other[right_on]]
return df1.merge(df_other, on=left_on, how=how)
def get_closest_match(x, other, cutoff):
matches = difflib.get_close_matches(x, other, cutoff=cutoff)
return matches[0] if matches else None
Run Code Online (Sandbox Code Playgroud)
以下是带有两个示例数据帧的一些用例:
print(df1)
key number
0 one 1
1 two 2
2 three 3
3 four 4
4 five 5
print(df2)
key_close letter
0 three c
1 one a
2 too b
3 fours d
4 a very different string e
Run Code Online (Sandbox Code Playgroud)
通过上面的例子,我们会得到:
fuzzy_merge(df1, df2, left_on='key', right_on='key_close')
key number key_close letter
0 one 1 one a
1 two 2 too b
2 three 3 three c
3 four 4 fours d
Run Code Online (Sandbox Code Playgroud)
我们可以进行左连接:
fuzzy_merge(df1, df2, left_on='key', right_on='key_close', how='left')
key number key_close letter
0 one 1 one a
1 two 2 too b
2 three 3 three c
3 four 4 fours d
4 five 5 NaN NaN
Run Code Online (Sandbox Code Playgroud)
对于右连接,我们将在左数据帧中拥有所有不匹配的键None
:
fuzzy_merge(df1, df2, left_on='key', right_on='key_close', how='right')
key number key_close letter
0 one 1.0 one a
1 two 2.0 too b
2 three 3.0 three c
3 four 4.0 fours d
4 None NaN a very different string e
Run Code Online (Sandbox Code Playgroud)
另请注意,如果截止范围内没有匹配的项目,将返回一个空列表。在共享示例中,如果我们将最后一个索引更改为:difflib.get_close_matches
df2
print(df2)
letter
one a
too b
three c
fours d
a very different string e
Run Code Online (Sandbox Code Playgroud)
我们会得到一个index out of range
错误:
df2.index.map(lambda x: difflib.get_close_matches(x, df1.index)[0])
Run Code Online (Sandbox Code Playgroud)
IndexError:列表索引超出范围
为了解决这个问题,上面的函数将通过仅在实际上包含任何匹配项时get_closest_match
对返回的列表进行索引来返回最接近的匹配项。difflib.get_close_matches
我会使用Jaro-Winkler,因为它是目前可用的性能最高且最准确的近似字符串匹配算法之一[ Cohen,et al.],[ 温克勒 ].
这就是我用水母包中的Jaro-Winkler做的方法:
def get_closest_match(x, list_strings):
best_match = None
highest_jw = 0
for current_string in list_strings:
current_score = jellyfish.jaro_winkler(x, current_string)
if(current_score > highest_jw):
highest_jw = current_score
best_match = current_string
return best_match
df1 = pandas.DataFrame([[1],[2],[3],[4],[5]], index=['one','two','three','four','five'], columns=['number'])
df2 = pandas.DataFrame([['a'],['b'],['c'],['d'],['e']], index=['one','too','three','fours','five'], columns=['letter'])
df2.index = df2.index.map(lambda x: get_closest_match(x, df1.index))
df1.join(df2)
Run Code Online (Sandbox Code Playgroud)
输出:
number letter
one 1 a
two 2 b
three 3 c
four 4 d
five 5 e
Run Code Online (Sandbox Code Playgroud)
fuzzywuzzy
由于该fuzzywuzzy
软件包没有示例,因此我编写了一个函数,该函数将根据您可以设置为用户的阈值返回所有匹配项:
示例datframe
df1 = pd.DataFrame({'Key':['Apple', 'Banana', 'Orange', 'Strawberry']})
df2 = pd.DataFrame({'Key':['Aple', 'Mango', 'Orag', 'Straw', 'Bannanna', 'Berry']})
# df1
Key
0 Apple
1 Banana
2 Orange
3 Strawberry
# df2
Key
0 Aple
1 Mango
2 Orag
3 Straw
4 Bannanna
5 Berry
Run Code Online (Sandbox Code Playgroud)
模糊匹配功能
def fuzzy_merge(df_1, df_2, key1, key2, threshold=90, limit=2):
"""
df_1 is the left table to join
df_2 is the right table to join
key1 is the key column of the left table
key2 is the key column of the right table
threshold is how close the matches should be to return a match, based on Levenshtein distance
limit is the amount of matches that will get returned, these are sorted high to low
"""
s = df_2[key2].tolist()
m = df_1[key1].apply(lambda x: process.extract(x, s, limit=limit))
df_1['matches'] = m
m2 = df_1['matches'].apply(lambda x: ', '.join([i[0] for i in x if i[1] >= threshold]))
df_1['matches'] = m2
return df_1
Run Code Online (Sandbox Code Playgroud)
在数据帧上使用我们的函数: #1
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
fuzzy_merge(df1, df2, 'Key', 'Key', threshold=80)
Key matches
0 Apple Aple
1 Banana Bannanna
2 Orange Orag
3 Strawberry Straw, Berry
Run Code Online (Sandbox Code Playgroud)
在数据框上使用我们的函数: #2
df1 = pd.DataFrame({'Col1':['Microsoft', 'Google', 'Amazon', 'IBM']})
df2 = pd.DataFrame({'Col2':['Mcrsoft', 'gogle', 'Amason', 'BIM']})
fuzzy_merge(df1, df2, 'Col1', 'Col2', 80)
Col1 matches
0 Microsoft Mcrsoft
1 Google gogle
2 Amazon Amason
3 IBM
Run Code Online (Sandbox Code Playgroud)
点子
pip install fuzzywuzzy
Run Code Online (Sandbox Code Playgroud)
水蟒
conda install -c conda-forge fuzzywuzzy
Run Code Online (Sandbox Code Playgroud)
http://pandas.pydata.org/pandas-docs/dev/merging.html没有挂钩功能来动态执行此操作.虽然会很好......
我只是做一个单独的步骤并使用difflib getclosest_matches在2个数据帧之一中创建一个新列,并在模糊匹配列上创建合并/连接
我使用 Fuzzymatcher 包,这对我来说效果很好。请访问此链接了解更多详细信息。
使用以下命令进行安装
pip install fuzzymatcher
Run Code Online (Sandbox Code Playgroud)
下面是示例代码(上面已经由 RobinL 提交)
from fuzzymatcher import link_table, fuzzy_left_join
# Columns to match on from df_left
left_on = ["fname", "mname", "lname", "dob"]
# Columns to match on from df_right
right_on = ["name", "middlename", "surname", "date"]
# The link table potentially contains several matches for each record
fuzzymatcher.link_table(df_left, df_right, left_on, right_on)
Run Code Online (Sandbox Code Playgroud)
您可能会遇到的错误
优点:
缺点:
有一个名为 的包fuzzy_pandas
可以使用levenshtein
、jaro
和metaphone
方法bilenco
。这里有一些很好的例子
import pandas as pd
import fuzzy_pandas as fpd
df1 = pd.DataFrame({'Key':['Apple', 'Banana', 'Orange', 'Strawberry']})
df2 = pd.DataFrame({'Key':['Aple', 'Mango', 'Orag', 'Straw', 'Bannanna', 'Berry']})
results = fpd.fuzzy_merge(df1, df2,
left_on='Key',
right_on='Key',
method='levenshtein',
threshold=0.6)
results.head()
Run Code Online (Sandbox Code Playgroud)
Key Key
0 Apple Aple
1 Banana Bannanna
2 Orange Orag
Run Code Online (Sandbox Code Playgroud)
归档时间: |
|
查看次数: |
35468 次 |
最近记录: |