ghp*_*uru 47 python dataframe pandas
我试图在数据帧中添加一列deltaT,其中deltaT是连续行之间的时间差(如时间序列中的索引).
time value
2012-03-16 23:50:00 1
2012-03-16 23:56:00 2
2012-03-17 00:08:00 3
2012-03-17 00:10:00 4
2012-03-17 00:12:00 5
2012-03-17 00:20:00 6
2012-03-20 00:43:00 7
Run Code Online (Sandbox Code Playgroud)
期望的结果类似于以下(以分钟显示的deltaT单位):
time value deltaT
2012-03-16 23:50:00 1 0
2012-03-16 23:56:00 2 6
2012-03-17 00:08:00 3 12
2012-03-17 00:10:00 4 2
2012-03-17 00:12:00 5 2
2012-03-17 00:20:00 6 8
2012-03-20 00:43:00 7 23
Run Code Online (Sandbox Code Playgroud)
Jef*_*eff 58
注意这是使用numpy> = 1.7,对于numpy <1.7,请参见此处的转换:http://pandas.pydata.org/pandas-docs/dev/timeseries.html#time-deltas
您的原始框架,带有日期时间索引
In [196]: df
Out[196]:
value
2012-03-16 23:50:00 1
2012-03-16 23:56:00 2
2012-03-17 00:08:00 3
2012-03-17 00:10:00 4
2012-03-17 00:12:00 5
2012-03-17 00:20:00 6
2012-03-20 00:43:00 7
In [199]: df.index
Out[199]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2012-03-16 23:50:00, ..., 2012-03-20 00:43:00]
Length: 7, Freq: None, Timezone: None
Run Code Online (Sandbox Code Playgroud)
这是你想要的timedelta64
In [200]: df['tvalue'] = df.index
In [201]: df['delta'] = (df['tvalue']-df['tvalue'].shift()).fillna(0)
In [202]: df
Out[202]:
value tvalue delta
2012-03-16 23:50:00 1 2012-03-16 23:50:00 00:00:00
2012-03-16 23:56:00 2 2012-03-16 23:56:00 00:06:00
2012-03-17 00:08:00 3 2012-03-17 00:08:00 00:12:00
2012-03-17 00:10:00 4 2012-03-17 00:10:00 00:02:00
2012-03-17 00:12:00 5 2012-03-17 00:12:00 00:02:00
2012-03-17 00:20:00 6 2012-03-17 00:20:00 00:08:00
2012-03-20 00:43:00 7 2012-03-20 00:43:00 3 days, 00:23:00
Run Code Online (Sandbox Code Playgroud)
在忽略日差(你的最后一天是3/20,之前是3/17)的同时得出答案,实际上是棘手的
In [204]: df['ans'] = df['delta'].apply(lambda x: x / np.timedelta64(1,'m')).astype('int64') % (24*60)
In [205]: df
Out[205]:
value tvalue delta ans
2012-03-16 23:50:00 1 2012-03-16 23:50:00 00:00:00 0
2012-03-16 23:56:00 2 2012-03-16 23:56:00 00:06:00 6
2012-03-17 00:08:00 3 2012-03-17 00:08:00 00:12:00 12
2012-03-17 00:10:00 4 2012-03-17 00:10:00 00:02:00 2
2012-03-17 00:12:00 5 2012-03-17 00:12:00 00:02:00 2
2012-03-17 00:20:00 6 2012-03-17 00:20:00 00:08:00 8
2012-03-20 00:43:00 7 2012-03-20 00:43:00 3 days, 00:23:00 23
Run Code Online (Sandbox Code Playgroud)
Nic*_*eli 28
我们可以创建一个索引和值等于索引键的系列,to_series然后计算连续行之间的差异,这将导致timedelta64[ns]dtype.获得这个之后,通过.dt属性,我们可以访问时间部分的seconds属性,最后将每个元素除以60,以便在几分钟内输出(可选地用0填充第一个值).
In [13]: df['deltaT'] = df.index.to_series().diff().dt.seconds.div(60, fill_value=0)
...: df # use .astype(int) to obtain integer values
Out[13]:
value deltaT
time
2012-03-16 23:50:00 1 0.0
2012-03-16 23:56:00 2 6.0
2012-03-17 00:08:00 3 12.0
2012-03-17 00:10:00 4 2.0
2012-03-17 00:12:00 5 2.0
2012-03-17 00:20:00 6 8.0
2012-03-20 00:43:00 7 23.0
Run Code Online (Sandbox Code Playgroud)
简化:
当我们执行diff:
In [8]: ser_diff = df.index.to_series().diff()
In [9]: ser_diff
Out[9]:
time
2012-03-16 23:50:00 NaT
2012-03-16 23:56:00 0 days 00:06:00
2012-03-17 00:08:00 0 days 00:12:00
2012-03-17 00:10:00 0 days 00:02:00
2012-03-17 00:12:00 0 days 00:02:00
2012-03-17 00:20:00 0 days 00:08:00
2012-03-20 00:43:00 3 days 00:23:00
Name: time, dtype: timedelta64[ns]
Run Code Online (Sandbox Code Playgroud)
秒到分钟转换:
In [10]: ser_diff.dt.seconds.div(60, fill_value=0)
Out[10]:
time
2012-03-16 23:50:00 0.0
2012-03-16 23:56:00 6.0
2012-03-17 00:08:00 12.0
2012-03-17 00:10:00 2.0
2012-03-17 00:12:00 2.0
2012-03-17 00:20:00 8.0
2012-03-20 00:43:00 23.0
Name: time, dtype: float64
Run Code Online (Sandbox Code Playgroud)
如果您想要包括date之前被排除的部分(仅考虑时间部分),dt.total_seconds则会给出经过的持续时间(以秒为单位),然后可以通过除法再次计算分钟数.
In [12]: ser_diff.dt.total_seconds().div(60, fill_value=0)
Out[12]:
time
2012-03-16 23:50:00 0.0
2012-03-16 23:56:00 6.0
2012-03-17 00:08:00 12.0
2012-03-17 00:10:00 2.0
2012-03-17 00:12:00 2.0
2012-03-17 00:20:00 8.0
2012-03-20 00:43:00 4343.0 # <-- number of minutes in 3 days 23 minutes
Name: time, dtype: float64
Run Code Online (Sandbox Code Playgroud)
>= Numpy version 1.7.0.
也可以强制转换 df.index.to_series().diff()从timedelta64[ns](纳米秒至默认D型)至timedelta64[m](分钟)[变频(astyping相当于地板分裂的)]
df['?T'] = df.index.to_series().diff().astype('timedelta64[m]')
value ?T
time
2012-03-16 23:50:00 1 NaN
2012-03-16 23:56:00 2 6.0
2012-03-17 00:08:00 3 12.0
2012-03-17 00:10:00 4 2.0
2012-03-17 00:12:00 5 2.0
2012-03-17 00:20:00 6 8.0
2012-03-20 00:43:00 7 4343.0
Run Code Online (Sandbox Code Playgroud)
(T' D型: float64)
如果你要转换int,填充na值与0转换前
>>> df.index.to_series().diff().fillna(0).astype('timedelta64[m]').astype('int')
time
2012-03-16 23:50:00 0
2012-03-16 23:56:00 6
2012-03-17 00:08:00 12
2012-03-17 00:10:00 2
2012-03-17 00:12:00 2
2012-03-17 00:20:00 8
2012-03-20 00:43:00 4343
Name: time, dtype: int64
Run Code Online (Sandbox Code Playgroud)
对于 pandas 版本 >0.24.0.,也可以转换为pandas 可为空的整数数据类型(Int64)
>>> df.index.to_series().diff().astype('timedelta64[m]').astype('Int64')
time
2012-03-16 23:50:00 <NA>
2012-03-16 23:56:00 6
2012-03-17 00:08:00 12
2012-03-17 00:10:00 2
2012-03-17 00:12:00 2
2012-03-17 00:20:00 8
2012-03-20 00:43:00 4343
Name: time, dtype: Int64
Run Code Online (Sandbox Code Playgroud)
Timedelta 数据类型支持大量时间单位,以及可以强制转换为任何其他单位的通用单位。
以下是日期单位:
Y year
M month
W week
D day
Run Code Online (Sandbox Code Playgroud)
以下是时间单位:
h hour
m minute
s second
ms millisecond
us microsecond
ns nanosecond
ps picosecond
fs femtosecond
as attosecond
Run Code Online (Sandbox Code Playgroud)
如果你想差到小数使用true division,即除以np.timedelta64(1, 'm')
例如,如果 df 如下,
value
time
2012-03-16 23:50:21 1
2012-03-16 23:56:28 2
2012-03-17 00:08:08 3
2012-03-17 00:10:56 4
2012-03-17 00:12:12 5
2012-03-17 00:20:00 6
2012-03-20 00:43:43 7
Run Code Online (Sandbox Code Playgroud)
检查 asyping( floor division) 和true division下面的区别。
>>> df.index.to_series().diff().astype('timedelta64[m]')
time
2012-03-16 23:50:21 NaN
2012-03-16 23:56:28 6.0
2012-03-17 00:08:08 11.0
2012-03-17 00:10:56 2.0
2012-03-17 00:12:12 1.0
2012-03-17 00:20:00 7.0
2012-03-20 00:43:43 4343.0
Name: time, dtype: float64
>>> df.index.to_series().diff()/np.timedelta64(1, 'm')
time
2012-03-16 23:50:21 NaN
2012-03-16 23:56:28 6.116667
2012-03-17 00:08:08 11.666667
2012-03-17 00:10:56 2.800000
2012-03-17 00:12:12 1.266667
2012-03-17 00:20:00 7.800000
2012-03-20 00:43:43 4343.716667
Name: time, dtype: float64
Run Code Online (Sandbox Code Playgroud)