CAR*_*ERL 5 python merge nearest-neighbor exact-match pandas
我有两个 Pandas 数据框,我想在多列(比如 3)和大约(即最近的邻居)上的一个(日期)列上加入/合并。我也想返回它们之间的差异(天)。每个数据集大约有 50,000 行长。我对内连接最感兴趣,但“剩菜”也很有趣,如果不是太难掌握的话。大多数“完全匹配”观察结果将在每个数据框中多次出现。
我一直在尝试使用difflib.get_close_matches将它们全部连接为字符串(这很愚蠢,我知道!)但并不总是给出完全匹配。我想我需要先循环精确匹配,然后在该组中找到最近的匹配,但我似乎无法正确...
数据框看起来像:
df1 = pd.DataFrame({'index': ['a1','a2','a3','a4'], 'col1': ['1232','432','432','123'], 'col2': ['asd','dsa12','dsa12','asd2'], 'col3': ['1','2','2','3'], 'date': ['2010-01-23','2016-05-20','2010-06-20','2008-10-21'],}).set_index('index')
df1
Out[430]:
col1 col2 col3 date
index
a1 1232 asd 1 2010-01-23
a2 432 dsa12 2 2016-05-20
a3 432 dsa12 2 2010-06-20
a4 123 asd2 3 2008-10-21
df2 = pd.DataFrame({'index': ['b1','b2','b3','b4'], 'col1': ['132','432','432','123'], 'col2': ['asd','dsa12','dsa12','sd2'], 'col3': ['1','2','2','3'], 'date': ['2010-01-23','2016-05-23','2010-06-10','2008-10-21'],}).set_index('index')
df2
Out[434]:
col1 col2 col3 date
index
b1 132 asd 1 2010-01-23
b2 432 dsa12 2 2016-05-23
b3 432 dsa12 2 2010-06-10
b4 123 sd2 3 2008-10-21
Run Code Online (Sandbox Code Playgroud)
最后,我想要类似的东西:
col1 col2 col3 date diff match_index
index
a1 1232 asd 1 2010-01-23 nan nan
a2 432 dsa12 2 2016-05-20 -3 b2
a3 432 dsa12 2 2010-06-20 10 b3
a4 123 asd2 3 2008-10-21 nan nan
a5 123 sd2 3 2008-10-21 nan b4
Run Code Online (Sandbox Code Playgroud)
或者如果仅使用内部连接更容易,我想:
col1 col2 col3 date diff match_index
index
a2 432 dsa12 2 2016-05-20 -3 b2
a3 432 dsa12 2 2010-06-20 10 b3
Run Code Online (Sandbox Code Playgroud)
我不确定这是否适合。它或多或少实现了您想要的,但实际上并不执行合并。它遵循与此问题相同的想法,只不过df1不是仅基于一列对 进行子集化,这里我们使用 a 匹配多列groupby并在两个数据帧上执行此操作。如果您确实想显式包含该merge命令并且对内部联接感到满意,请检查答案的最底部,它包含一个片段。
import pandas as pd
from sklearn.neighbors import NearestNeighbors
def find_nearest(group, df2, groupname):
try:
match = df2.groupby(groupname).get_group(group.name)
match['date'] = pd.to_datetime(match.date, unit = 'D')
nbrs = NearestNeighbors(1).fit(match['date'].values[:, None])
dist, ind = nbrs.kneighbors(group['date'].values[:, None])
group['date1'] = group['date']
group['date'] = match['date'].values[ind.ravel()]
group['diff'] = (group['date1']-group['date'])
group['match_index'] = match.index[ind.ravel()]
return group
except KeyError:
return group
#change dates from string to datetime
df1['date'] = pd.to_datetime(df1.date, unit = 'D')
df2['date'] = pd.to_datetime(df2.date, unit = 'D')
#find closest dates and differences
keys = ['col1', 'col2', 'col3']
df1_mod = df1.groupby(keys).apply(find_nearest, df2, keys)
#fill unmatched dates
df1_mod.date1.fillna(df1_mod.date, inplace=True)
df2_mod = df2.groupby(keys).apply(find_nearest, df1, keys)
df2_mod.date1.fillna(df2_mod.date, inplace=True)
#drop original column
df1_mod.drop('date', inplace=True, axis=1)
df1_mod.rename(columns = {'date1':'date'}, inplace=True)
df2_mod.drop('date', inplace=True, axis=1)
df2_mod.rename(columns = {'date1':'date'}, inplace=True)
df2_mod['diff'] = -df2_mod['diff']
#drop redundant values
df2_mod.drop(df2_mod[df2_mod.match_index.str.len()>0].index, inplace=True)
#merge the two
df_final = pd.merge(df1_mod, df2_mod, how='outer')
Run Code Online (Sandbox Code Playgroud)
这会产生以下结果:
In [349]: df_final
Out[349]:
col1 col2 col3 date diff match_index
0 1232 asd 1 2010-01-23 NaT NaN
1 432 dsa12 2 2016-05-20 -3 days b2
2 432 dsa12 2 2010-06-20 10 days b3
3 123 asd2 3 2008-10-21 NaT NaN
4 132 asd 1 2010-01-23 NaT NaN
5 123 sd2 3 2008-10-21 NaT NaN
Run Code Online (Sandbox Code Playgroud)
使用合并命令:
In [208]: pd.merge(df1_mod, df2.drop('date', axis=1), on=['col1', 'col2', 'col3']).drop_duplicates()
Out[208]:
col1 col2 col3 date diff match_index
0 432 dsa12 2 2016-05-20 -3 days b2
2 432 dsa12 2 2010-06-20 10 days b3
Run Code Online (Sandbox Code Playgroud)
评论中考虑的案例,即:
df1 = pd.DataFrame({'index': ['a1','a2','a3','a4'], 'col1': ['1232','1432','432','123'], 'col2': ['asd','dsa12','dsa12','asd2'], 'col3': ['1','2','2','3'], 'date': ['2010-01-23','2016-05-20','2010-06-20','2008-10-21'],}).set_index('index')
Run Code Online (Sandbox Code Playgroud)
产生以下结果:
In [351]: df_final
Out[351]:
col1 col2 col3 date diff match_index
0 1232 asd 1 2010-01-23 NaT NaN
1 1432 dsa12 2 2016-05-20 NaT NaN
2 432 dsa12 2 2010-06-20 10 days b3
3 123 asd2 3 2008-10-21 NaT NaN
4 132 asd 1 2010-01-23 NaT NaN
5 123 sd2 3 2008-10-21 NaT NaN
Run Code Online (Sandbox Code Playgroud)