按行列出的重叠时间范围的长度

Mad*_*dan 5 python numpy dataframe python-datetime pandas

我正在使用 pandas 版本 1.0.5

下面的示例数据框列出了三天内记录的时间间隔,我寻找每天都有哪些时间间隔重叠的地方。

临时-用于说明

例如,所有三个日期(黄色突出显示)的重叠时间之一是 1:16 - 2:13。另一个(蓝色突出显示)为 18:45 - 19:00

所以我的预期输出是这样的:[57,15]因为

  • 57 - 1:16 - 2:13 之间的分钟。
  • 18:45 - 19:00 之间 15 分钟

请使用输入数据帧的生成器:

import pandas as pd
dat1 = [
    ['2023-12-27','2023-12-27 00:00:00','2023-12-27 02:14:00'],
    ['2023-12-27','2023-12-27 03:16:00','2023-12-27 04:19:00'],
    ['2023-12-27','2023-12-27 18:11:00','2023-12-27 20:13:00'],
    ['2023-12-28','2023-12-28 01:16:00','2023-12-28 02:14:00'],
    ['2023-12-28','2023-12-28 02:16:00','2023-12-28 02:28:00'],
    ['2023-12-28','2023-12-28 02:30:00','2023-12-28 02:56:00'],
    ['2023-12-28','2023-12-28 18:45:00','2023-12-28 19:00:00'],
    ['2023-12-29','2023-12-29 01:16:00','2023-12-29 02:13:00'],
    ['2023-12-29','2023-12-29 04:16:00','2023-12-29 05:09:00'],
    ['2023-12-29','2023-12-29 05:11:00','2023-12-29 05:14:00'],
    ['2023-12-29','2023-12-29 18:00:00','2023-12-29 19:00:00']
       ]
df = pd.DataFrame(dat1,columns = ['date','Start_tmp','End_tmp'])
df["Start_tmp"] = pd.to_datetime(df["Start_tmp"])
df["End_tmp"] = pd.to_datetime(df["End_tmp"])
Run Code Online (Sandbox Code Playgroud)

OCa*_*OCa 3

该解决方案使用:

  • numpy,没有不常见的Python模块,所以使用Python 1.0.5你应该,希望,是清楚的,
  • 没有嵌套循环来处理不断增长的数据集的速度问题,

方法:

  • 绘制重叠的景观
  • 然后选择与记录天数相对应的重叠,
  • 最后用长度来描述重叠部分

记录的天数:(如Python 中:将 dataframe 中的 timedelta 转换为 int

n = 1 + ( max(df['End_tmp']) - min(df['Start_tmp']) ).days
n
3
Run Code Online (Sandbox Code Playgroud)

加法景观:

# initial flat whole-day landcape (height: 0)
L = np.zeros(24*60, dtype='int')
# add up ranges: (reused @sammywemmy's perfect formula for time of day in minutes)
for start, end in zip(df['Start_tmp'].dt.hour.mul(60) + df['Start_tmp'].dt.minute,  # Start_tmp timestamps expressed in minutes
                      df['End_tmp'].dt.hour.mul(60)   + df['End_tmp'].dt.minute):   # End_tmp timestamps expressed in minutes
    L[start:end+1] += 1

plt.plot(L)
plt.hlines(y=[2,3],xmin=0,xmax=1400,colors=['green','red'], linestyles='dashed')
plt.xlabel('time of day (minutes)')
plt.ylabel('time range overlaps')
Run Code Online (Sandbox Code Playgroud)

加法景观 (请原谅拼写错误:这显然是分钟,而不是秒)

仅保留所有日期的重叠:(红线,n=3)

# Reduce heights <n to 0 because not overlaping every day
L[L<n]=0
# Simplify all greater values to 1 because only their presence matters
L[L>0]=1
# Now only overlaps are highlighted
# (Actually this latest line is disposable, provided we filter all but the overlaps of rank n. Useful only if you were to include lower overlaps)
Run Code Online (Sandbox Code Playgroud)

提取重叠范围及其长度

# Highlight edges of overlaping intervals
D = np.diff(L)
# Describe overlaps as ranges
R = list(zip([a[0]   for a in np.argwhere(D>0)],  # indices where overlaps *begin*, with scalar indices instead of arrays
             [a[0]-1 for a in np.argwhere(D<0)])) # indices where overlaps *end*, with scalar indices instead of arrays
R
[(75, 132), (1124, 1139)]
Run Code Online (Sandbox Code Playgroud)
# Finally their lengths
[b-a for a,b in R]
Run Code Online (Sandbox Code Playgroud)

最终输出:[57, 15]