有哪些方法可用于合并时间戳不完全匹配的列?
DF1:
date start_time employee_id session_id
01/01/2016 01/01/2016 06:03:13 7261824 871631182
Run Code Online (Sandbox Code Playgroud)
DF2:
date start_time employee_id session_id
01/01/2016 01/01/2016 06:03:37 7261824 871631182
Run Code Online (Sandbox Code Playgroud)
我可以加入['date','employee_id','session_id'],但有时同一个员工在同一天会有多个相同的会话,这会导致重复.我可以删除发生这种情况的行,但如果我这样做,我将失去有效的会话.
如果DF1的时间戳距离DF2的时间戳<5分钟,并且session_id和employee_id也匹配,是否有一种有效的加入方式?如果存在匹配记录,则时间戳将始终稍晚于DF1,因为事件在将来某个时间点触发.
['employee_id', 'session_id', 'timestamp<5minutes']
Run Code Online (Sandbox Code Playgroud)
编辑 - 我以为有人会遇到这个问题.
我在想这样做:
创建一个10分钟的间隔字符串以加入文件
Run Code Online (Sandbox Code Playgroud)df1['low_time'] = df1['start_time'] - timedelta(minutes=5) df1['high_time'] = df1['start_time'] + timedelta(minutes=5) df1['interval_string'] = df1['low_time'].astype(str) + df1['high_time'].astype(str)
有人知道如何将这5分钟的间隔绕到最近的5分钟标记处吗?
02:59:37 - 5分钟= 02:55:00
02:59:37 + 5分钟= 03:05:00
interval_string = '02:55:00-03:05:00'
pd.merge(df1, df2, how = 'left', on = ['employee_id', 'session_id', 'date', 'interval_string']
Run Code Online (Sandbox Code Playgroud)
有谁知道如何围绕这样的时间?这似乎可行.您仍然根据日期,员工和会话进行匹配,然后查找基本上在相同的10分钟间隔或范围内的时间
oso*_*uyi 17
我会尝试在熊猫中使用这个方法:
感兴趣的参数,你会是direction
,tolerance
,left_on
,和right_on
建立@Igor答案:
import pandas as pd
from pandas import read_csv
from io import StringIO
# datetime column (combination of date + start_time)
dtc = [['date', 'start_time']]
# index column (above combination)
ixc = 'date_start_time'
df1 = read_csv(StringIO(u'''
date,start_time,employee_id,session_id
01/01/2016,02:03:00,7261824,871631182
01/01/2016,06:03:00,7261824,871631183
01/01/2016,11:01:00,7261824,871631184
01/01/2016,14:01:00,7261824,871631185
'''), parse_dates=dtc)
df2 = read_csv(StringIO(u'''
date,start_time,employee_id,session_id
01/01/2016,02:03:00,7261824,871631182
01/01/2016,06:05:00,7261824,871631183
01/01/2016,11:04:00,7261824,871631184
01/01/2016,14:10:00,7261824,871631185
'''), parse_dates=dtc)
df1['date_start_time'] = pd.to_datetime(df1['date_start_time'])
df2['date_start_time'] = pd.to_datetime(df2['date_start_time'])
# converting this to the index so we can preserve the date_start_time columns so you can validate the merging logic
df1.index = df1['date_start_time']
df2.index = df2['date_start_time']
# the magic happens below, check the direction and tolerance arguments
tol = pd.Timedelta('5 minute')
pd.merge_asof(left=df1,right=df2,right_index=True,left_index=True,direction='nearest',tolerance=tol)
Run Code Online (Sandbox Code Playgroud)
date_start_time date_start_time_x employee_id_x session_id_x date_start_time_y employee_id_y session_id_y
2016-01-01 02:03:00 2016-01-01 02:03:00 7261824 871631182 2016-01-01 02:03:00 7261824.0 871631182.0
2016-01-01 06:03:00 2016-01-01 06:03:00 7261824 871631183 2016-01-01 06:05:00 7261824.0 871631183.0
2016-01-01 11:01:00 2016-01-01 11:01:00 7261824 871631184 2016-01-01 11:04:00 7261824.0 871631184.0
2016-01-01 14:01:00 2016-01-01 14:01:00 7261824 871631185 NaT NaN NaN
Run Code Online (Sandbox Code Playgroud)
考虑以下迷你版本的问题:
from io import StringIO
from pandas import read_csv, to_datetime
# how close do sessions have to be to be considered equal? (in minutes)
threshold = 5
# datetime column (combination of date + start_time)
dtc = [['date', 'start_time']]
# index column (above combination)
ixc = 'date_start_time'
df1 = read_csv(StringIO(u'''
date,start_time,employee_id,session_id
01/01/2016,02:03:00,7261824,871631182
01/01/2016,06:03:00,7261824,871631183
01/01/2016,11:01:00,7261824,871631184
01/01/2016,14:01:00,7261824,871631185
'''), parse_dates=dtc)
df2 = read_csv(StringIO(u'''
date,start_time,employee_id,session_id
01/01/2016,02:03:00,7261824,871631182
01/01/2016,06:05:00,7261824,871631183
01/01/2016,11:04:00,7261824,871631184
01/01/2016,14:10:00,7261824,871631185
'''), parse_dates=dtc)
Run Code Online (Sandbox Code Playgroud)
这使
>>> df1
date_start_time employee_id session_id
0 2016-01-01 02:03:00 7261824 871631182
1 2016-01-01 06:03:00 7261824 871631183
2 2016-01-01 11:01:00 7261824 871631184
3 2016-01-01 14:01:00 7261824 871631185
>>> df2
date_start_time employee_id session_id
0 2016-01-01 02:03:00 7261824 871631182
1 2016-01-01 06:05:00 7261824 871631183
2 2016-01-01 11:04:00 7261824 871631184
3 2016-01-01 14:10:00 7261824 871631185
Run Code Online (Sandbox Code Playgroud)
您希望将其视为合并时的df2[0:3]
重复项df1[0:3]
(因为它们分别相距少于5分钟),但请视为df1[3]
并df2[3]
作为单独的会话。
这基本上就是您在编辑中建议的内容。您希望将两个表中的时间戳映射到以时间戳为中心的10分钟间隔,并四舍五入到最接近的5分钟。
每个间隔都可以由其中点唯一表示,因此您可以合并时间戳上的数据帧,四舍五入到最接近的5分钟。例如:
import numpy as np
# half-threshold in nanoseconds
threshold_ns = threshold * 60 * 1e9
# compute "interval" to which each session belongs
df1['interval'] = to_datetime(np.round(df1.date_start_time.astype(np.int64) / threshold_ns) * threshold_ns)
df2['interval'] = to_datetime(np.round(df2.date_start_time.astype(np.int64) / threshold_ns) * threshold_ns)
# join
cols = ['interval', 'employee_id', 'session_id']
print df1.merge(df2, on=cols, how='outer')[cols]
Run Code Online (Sandbox Code Playgroud)
哪个打印
interval employee_id session_id
0 2016-01-01 02:05:00 7261824 871631182
1 2016-01-01 06:05:00 7261824 871631183
2 2016-01-01 11:00:00 7261824 871631184
3 2016-01-01 14:00:00 7261824 871631185
4 2016-01-01 11:05:00 7261824 871631184
5 2016-01-01 14:10:00 7261824 871631185
Run Code Online (Sandbox Code Playgroud)
请注意,这并不完全正确。会话df1[2]
和和df2[2]
,尽管相距仅3分钟,却不被视为重复。这是因为它们位于间隔边界的不同侧。
这是另一种方法,它取决于in中的会话在中df1
具有零个或一个重复项的条件df2
。
我们将时间戳替换df1
为最近的时间戳,df2
其中匹配的时间为employee_id
,session_id
并且相距不到5分钟。
from datetime import timedelta
# get closest match from "df2" to row from "df1" (as long as it's below the threshold)
def closest(row):
matches = df2.loc[(df2.employee_id == row.employee_id) &
(df2.session_id == row.session_id)]
deltas = matches.date_start_time - row.date_start_time
deltas = deltas.loc[deltas <= timedelta(minutes=threshold)]
try:
return matches.loc[deltas.idxmin()]
except ValueError: # no items
return row
# replace timestamps in "df1" with closest timestamps in "df2"
df1 = df1.apply(closest, axis=1)
# join
cols = ['date_start_time', 'employee_id', 'session_id']
print df1.merge(df2, on=cols, how='outer')[cols]
Run Code Online (Sandbox Code Playgroud)
哪个打印
date_start_time employee_id session_id
0 2016-01-01 02:03:00 7261824 871631182
1 2016-01-01 06:05:00 7261824 871631183
2 2016-01-01 11:04:00 7261824 871631184
3 2016-01-01 14:01:00 7261824 871631185
4 2016-01-01 14:10:00 7261824 871631185
Run Code Online (Sandbox Code Playgroud)
这种方法明显较慢,因为您必须在中搜索df2
每一行的全部df1
。我写的内容可能可以进一步优化,但是在大型数据集上仍然需要很长时间。