我有两个数据框(logs和failures),我想将其合并,以便添加logs一列,其值是在“故障”中找到的最接近日期的值。
生成logs,failures和所需代码的代码output如下:
import pandas as pd
logs=pd.DataFrame({'date-time':pd.Series(['23/10/2015 10:20:54','22/10/2015 09:51:32','21/10/2015 06:51:32','28/10/2015 16:59:32','25/10/2015 04:41:32','24/10/2015 11:50:11']),'var1':pd.Series([0,1,3,1,2,4])})
logs['date-time']=pd.to_datetime(logs['date-time'])
failures=pd.DataFrame({'date':pd.Series(['23/10/2015 00:00:00','22/10/2015 00:00:00','21/10/2015 00:00:00']),'failure':pd.Series([1,1,1])})
failures['date']=pd.to_datetime(failures['date'])
output=pd.DataFrame({'date-time':pd.Series(['23/10/2015 10:20:54','22/10/2015 09:51:32','21/10/2015 06:51:32','28/10/2015 16:59:32','25/10/2015 04:41:32','24/10/2015 11:50:11']),'var1':pd.Series([0,1,3,1,2,4]),'closest_failure':pd.Series(['23/10/2015 00:00:00','22/10/2015 00:00:00','21/10/2015 00:00:00','23/10/2015 00:00:00','23/10/2015 00:00:00','23/10/2015 00:00:00'])})
output['date-time']=pd.to_datetime(output['date-time'])
Run Code Online (Sandbox Code Playgroud)
有任何想法吗?实际数据集非常大,因此效率也是一个问题。
您可以使用 method="nearest" 重新索引。可能有一种更简洁的方法,但是使用带有索引和值中的失败日志的系列是有效的:
In [11]: failures_dt = pd.Series(failures["date"].values, failures["date"])
In [12]: failures_dt.reindex(logs["date-time"], method="nearest")
Out[12]:
date-time
2015-10-23 10:20:54 2015-10-23
2015-10-22 09:51:32 2015-10-22
2015-10-21 06:51:32 2015-10-21
2015-10-28 16:59:32 2015-10-23
2015-10-25 04:41:32 2015-10-23
2015-10-24 11:50:11 2015-10-23
dtype: datetime64[ns]
In [13]: logs["nearest"] = failures_dt.reindex(logs["date-time"], method="nearest").values
In [14]: logs
Out[14]:
date-time var1 nearest
0 2015-10-23 10:20:54 0 2015-10-23
1 2015-10-22 09:51:32 1 2015-10-22
2 2015-10-21 06:51:32 3 2015-10-21
3 2015-10-28 16:59:32 1 2015-10-23
4 2015-10-25 04:41:32 2 2015-10-23
5 2015-10-24 11:50:11 4 2015-10-23
Run Code Online (Sandbox Code Playgroud)
在大于等于0.19.0的Pandas中,您现在可以pandas.merge_asof用来获取近距离比赛。使用0.19时,您只能获取在日志值之前或日志值处的最新故障值。但是,使用0.20时,您可以在任一方向上获得最接近的值。
执行asof合并。这类似于左联接,除了我们匹配最近的键而不是相等的键。
对于左侧DataFrame中的每一行,我们选择右侧DataFrame中“ on”键小于或等于左侧键的最后一行。两个DataFrame必须按键排序。
In [3]: failures.sort_values("date", inplace=True)
In [6]: logs2=pd.DataFrame({'date-time':pd.Series(['23/10/2015 10:20:54','22/10/2015 09:51:32','21/10/2015 06:51:32','28/10/2015 16:59:32','25/10/2015 04:41:32','24/10/2015 11:50
...: :11', "20/10/2015 01:02:03"]),'var1':pd.Series([0,1,3,1,2,4, 99])})
...:
In [7]: logs2['date-time']=pd.to_datetime(logs2['date-time'])
In [8]: logs2.sort_values("date-time", inplace=True)
In [9]: logs2
Out[9]:
date-time var1
6 2015-10-20 01:02:03 99
2 2015-10-21 06:51:32 3
1 2015-10-22 09:51:32 1
0 2015-10-23 10:20:54 0
5 2015-10-24 11:50:11 4
4 2015-10-25 04:41:32 2
3 2015-10-28 16:59:32 1
In [10]: pd.merge_asof(logs2, failures, left_on="date-time", right_on="date")
Out[10]:
date-time var1 date failure
0 2015-10-20 01:02:03 99 NaT NaN
1 2015-10-21 06:51:32 3 2015-10-21 1.0
2 2015-10-22 09:51:32 1 2015-10-22 1.0
3 2015-10-23 10:20:54 0 2015-10-23 1.0
4 2015-10-24 11:50:11 4 2015-10-23 1.0
5 2015-10-25 04:41:32 2 2015-10-23 1.0
6 2015-10-28 16:59:32 1 2015-10-23 1.0
In [11]: pd.merge_asof(logs2, failures, left_on="date-time", right_on="date", direction="nearest")
Out[11]:
date-time var1 date failure
0 2015-10-20 01:02:03 99 2015-10-21 1
1 2015-10-21 06:51:32 3 2015-10-21 1
2 2015-10-22 09:51:32 1 2015-10-22 1
3 2015-10-23 10:20:54 0 2015-10-23 1
4 2015-10-24 11:50:11 4 2015-10-23 1
5 2015-10-25 04:41:32 2 2015-10-23 1
6 2015-10-28 16:59:32 1 2015-10-23 1
Run Code Online (Sandbox Code Playgroud)