dav*_*avy 5 python dataframe pandas
我有两个数据框。每一个都有一个代表开始时间的时间戳索引和一个可以用于计算结束时间的持续时间值(以秒为单位)。每个数据帧的时间间隔和持续时间都不同,并且在每个数据帧内也可能有所不同。
duration param1
Start Time (UTC)
2017-10-14 02:00:31 60 95
2017-10-14 02:01:31 60 34
2017-10-14 02:02:31 60 10
2017-10-14 02:03:31 60 44
2017-10-14 02:04:31 60 63
2017-10-14 02:05:31 60 52
...
duration param2
Start Time (UTC)
2017-10-14 02:00:00 300 93
2017-10-14 02:05:00 300 95
2017-10-14 02:10:00 300 91
...
Run Code Online (Sandbox Code Playgroud)
我想加入这两个数据帧,以保持第一个的索引和列,但是使用以下方案将第二个的参数值复制到它:
对于第一数据帧中的每一行,从(排序的)第二数据帧中的第一行分配param2值,该值包含50%或更多的时间范围。
下面的示例输出:
duration param1 param2
Start Time (UTC)
2017-10-14 02:00:31 60 95 93
2017-10-14 02:01:31 60 34 93
2017-10-14 02:02:31 60 10 93
2017-10-14 02:03:31 60 44 93
2017-10-14 02:04:31 60 63 95
2017-10-14 02:05:31 60 52 95
...
Run Code Online (Sandbox Code Playgroud)
这似乎有效:
def join_on_fifty_pct_overlap(s, df):
df = df.copy()
s_duration_delta = pd.Timedelta(seconds = s["duration"])
df_duration_delta = pd.to_timedelta(df["duration"], unit='s')
s_end_time = s.name + s_duration_delta
df_end_time = df.index + df_duration_delta
df.loc[df.index > s.name, "larger start time"] = df.loc[df.index > s.name].index
df.loc[df.index <= s.name, "larger start time"] = s.name
df.loc[df_end_time < s_end_time, "smaller end time"] = df_end_time
df.loc[df_end_time >= s_end_time, "smaller end time"] = s_end_time
delta = df["smaller end time"] - df["larger start time"]
df = df.drop(["smaller end time", "larger start time", "duration"], axis=1)
acceptable_overlap = delta / s_duration_delta >= 0.5
matched_df = df[acceptable_overlap].iloc[0]
df_final = pd.concat([s, matched_df])
return df_final
df1.apply(join_on_fifty_pct_overlap, axis=1, args=[df2])
Run Code Online (Sandbox Code Playgroud)