计算Pandas Dataframe索引之间的时差

ghp*_*uru 47 python dataframe pandas

我试图在数据帧中添加一列deltaT,其中deltaT是连续行之间的时间差(如时间序列中的索引).

time                 value

2012-03-16 23:50:00      1
2012-03-16 23:56:00      2
2012-03-17 00:08:00      3
2012-03-17 00:10:00      4
2012-03-17 00:12:00      5
2012-03-17 00:20:00      6
2012-03-20 00:43:00      7
Run Code Online (Sandbox Code Playgroud)

期望的结果类似于以下(以分钟显示的deltaT单位):

time                 value  deltaT

2012-03-16 23:50:00      1       0
2012-03-16 23:56:00      2       6
2012-03-17 00:08:00      3      12
2012-03-17 00:10:00      4       2
2012-03-17 00:12:00      5       2
2012-03-17 00:20:00      6       8
2012-03-20 00:43:00      7      23
Run Code Online (Sandbox Code Playgroud)

Jef*_*eff 58

注意这是使用numpy> = 1.7,对于numpy <1.7,请参见此处的转换:http://pandas.pydata.org/pandas-docs/dev/timeseries.html#time-deltas

您的原始框架,带有日期时间索引

In [196]: df
Out[196]: 
                     value
2012-03-16 23:50:00      1
2012-03-16 23:56:00      2
2012-03-17 00:08:00      3
2012-03-17 00:10:00      4
2012-03-17 00:12:00      5
2012-03-17 00:20:00      6
2012-03-20 00:43:00      7

In [199]: df.index
Out[199]: 
<class 'pandas.tseries.index.DatetimeIndex'>
[2012-03-16 23:50:00, ..., 2012-03-20 00:43:00]
Length: 7, Freq: None, Timezone: None
Run Code Online (Sandbox Code Playgroud)

这是你想要的timedelta64

In [200]: df['tvalue'] = df.index

In [201]: df['delta'] = (df['tvalue']-df['tvalue'].shift()).fillna(0)

In [202]: df
Out[202]: 
                     value              tvalue            delta
2012-03-16 23:50:00      1 2012-03-16 23:50:00         00:00:00
2012-03-16 23:56:00      2 2012-03-16 23:56:00         00:06:00
2012-03-17 00:08:00      3 2012-03-17 00:08:00         00:12:00
2012-03-17 00:10:00      4 2012-03-17 00:10:00         00:02:00
2012-03-17 00:12:00      5 2012-03-17 00:12:00         00:02:00
2012-03-17 00:20:00      6 2012-03-17 00:20:00         00:08:00
2012-03-20 00:43:00      7 2012-03-20 00:43:00 3 days, 00:23:00
Run Code Online (Sandbox Code Playgroud)

在忽略日差(你的最后一天是3/20,之前是3/17)的同时得出答案,实际上是棘手的

In [204]: df['ans'] = df['delta'].apply(lambda x: x  / np.timedelta64(1,'m')).astype('int64') % (24*60)

In [205]: df
Out[205]: 
                     value              tvalue            delta  ans
2012-03-16 23:50:00      1 2012-03-16 23:50:00         00:00:00    0
2012-03-16 23:56:00      2 2012-03-16 23:56:00         00:06:00    6
2012-03-17 00:08:00      3 2012-03-17 00:08:00         00:12:00   12
2012-03-17 00:10:00      4 2012-03-17 00:10:00         00:02:00    2
2012-03-17 00:12:00      5 2012-03-17 00:12:00         00:02:00    2
2012-03-17 00:20:00      6 2012-03-17 00:20:00         00:08:00    8
2012-03-20 00:43:00      7 2012-03-20 00:43:00 3 days, 00:23:00   23
Run Code Online (Sandbox Code Playgroud)

  • 不确定这个更改发生在哪个版本中,但是在更新版本的 pandas 中,需要将 `.fillna(0)` 更改为 `.fillna(pd.Timedelta('0 days'))` 。 (3认同)

Nic*_*eli 28

我们可以创建一个索引和值等于索引键的系列,to_series然后计算连续行之间的差异,这将导致timedelta64[ns]dtype.获得这个之后,通过.dt属性,我们可以访问时间部分的seconds属性,最后将每个元素除以60,以便在几分钟内输出(可选地用0填充第一个值).

In [13]: df['deltaT'] = df.index.to_series().diff().dt.seconds.div(60, fill_value=0)
    ...: df                                 # use .astype(int) to obtain integer values
Out[13]: 
                     value  deltaT
time                              
2012-03-16 23:50:00      1     0.0
2012-03-16 23:56:00      2     6.0
2012-03-17 00:08:00      3    12.0
2012-03-17 00:10:00      4     2.0
2012-03-17 00:12:00      5     2.0
2012-03-17 00:20:00      6     8.0
2012-03-20 00:43:00      7    23.0
Run Code Online (Sandbox Code Playgroud)

简化:

当我们执行diff:

In [8]: ser_diff = df.index.to_series().diff()

In [9]: ser_diff
Out[9]: 
time
2012-03-16 23:50:00               NaT
2012-03-16 23:56:00   0 days 00:06:00
2012-03-17 00:08:00   0 days 00:12:00
2012-03-17 00:10:00   0 days 00:02:00
2012-03-17 00:12:00   0 days 00:02:00
2012-03-17 00:20:00   0 days 00:08:00
2012-03-20 00:43:00   3 days 00:23:00
Name: time, dtype: timedelta64[ns]
Run Code Online (Sandbox Code Playgroud)

秒到分钟转换:

In [10]: ser_diff.dt.seconds.div(60, fill_value=0)
Out[10]: 
time
2012-03-16 23:50:00     0.0
2012-03-16 23:56:00     6.0
2012-03-17 00:08:00    12.0
2012-03-17 00:10:00     2.0
2012-03-17 00:12:00     2.0
2012-03-17 00:20:00     8.0
2012-03-20 00:43:00    23.0
Name: time, dtype: float64
Run Code Online (Sandbox Code Playgroud)

如果您想要包括date之前被排除的部分(仅考虑时间部分),dt.total_seconds则会给出经过的持续时间(以秒为单位),然后可以通过除法再次计算分钟数.

In [12]: ser_diff.dt.total_seconds().div(60, fill_value=0)
Out[12]: 
time
2012-03-16 23:50:00       0.0
2012-03-16 23:56:00       6.0
2012-03-17 00:08:00      12.0
2012-03-17 00:10:00       2.0
2012-03-17 00:12:00       2.0
2012-03-17 00:20:00       8.0
2012-03-20 00:43:00    4343.0    # <-- number of minutes in 3 days 23 minutes
Name: time, dtype: float64
Run Code Online (Sandbox Code Playgroud)


Shi*_*ith 8

>= Numpy version 1.7.0.

也可以强制转换 df.index.to_series().diff()timedelta64[ns](纳米秒至默认D型)至timedelta64[m](分钟)[变频(astyping相当于地板分裂的)]

df['?T'] = df.index.to_series().diff().astype('timedelta64[m]')

                     value      ?T
time                              
2012-03-16 23:50:00      1     NaN
2012-03-16 23:56:00      2     6.0
2012-03-17 00:08:00      3    12.0
2012-03-17 00:10:00      4     2.0
2012-03-17 00:12:00      5     2.0
2012-03-17 00:20:00      6     8.0
2012-03-20 00:43:00      7  4343.0
Run Code Online (Sandbox Code Playgroud)

T' D型: float64

如果你要转换int,填充na值与0转换前

>>> df.index.to_series().diff().fillna(0).astype('timedelta64[m]').astype('int')

time
2012-03-16 23:50:00       0
2012-03-16 23:56:00       6
2012-03-17 00:08:00      12
2012-03-17 00:10:00       2
2012-03-17 00:12:00       2
2012-03-17 00:20:00       8
2012-03-20 00:43:00    4343
Name: time, dtype: int64

Run Code Online (Sandbox Code Playgroud)

对于 pandas 版本 >0.24.0.,也可以转换为pandas 可空的整数数据类型(Int64)

>>> df.index.to_series().diff().astype('timedelta64[m]').astype('Int64')

time
2012-03-16 23:50:00    <NA>
2012-03-16 23:56:00       6
2012-03-17 00:08:00      12
2012-03-17 00:10:00       2
2012-03-17 00:12:00       2
2012-03-17 00:20:00       8
2012-03-20 00:43:00    4343
Name: time, dtype: Int64

Run Code Online (Sandbox Code Playgroud)

Timedelta 数据类型支持大量时间单位,以及可以强制转换为任何其他单位的通用单位。

以下是日期单位:

Y   year
M   month
W   week
D   day
Run Code Online (Sandbox Code Playgroud)

以下是时间单位:

h   hour
m   minute
s   second
ms  millisecond
us  microsecond
ns  nanosecond
ps  picosecond
fs  femtosecond
as  attosecond
Run Code Online (Sandbox Code Playgroud)

如果你想差到小数使用true division,即除以np.timedelta64(1, 'm')
例如,如果 df 如下,

                     value
time                      
2012-03-16 23:50:21      1
2012-03-16 23:56:28      2
2012-03-17 00:08:08      3
2012-03-17 00:10:56      4
2012-03-17 00:12:12      5
2012-03-17 00:20:00      6
2012-03-20 00:43:43      7

Run Code Online (Sandbox Code Playgroud)

检查 asyping( floor division) 和true division下面的区别。

>>> df.index.to_series().diff().astype('timedelta64[m]')
time
2012-03-16 23:50:21       NaN
2012-03-16 23:56:28       6.0
2012-03-17 00:08:08      11.0
2012-03-17 00:10:56       2.0
2012-03-17 00:12:12       1.0
2012-03-17 00:20:00       7.0
2012-03-20 00:43:43    4343.0
Name: time, dtype: float64

>>> df.index.to_series().diff()/np.timedelta64(1, 'm')
time
2012-03-16 23:50:21            NaN
2012-03-16 23:56:28       6.116667
2012-03-17 00:08:08      11.666667
2012-03-17 00:10:56       2.800000
2012-03-17 00:12:12       1.266667
2012-03-17 00:20:00       7.800000
2012-03-20 00:43:43    4343.716667
Name: time, dtype: float64


Run Code Online (Sandbox Code Playgroud)