我正在分析一个时间序列,并根据某些标准,我可以挑选出事件的开始或结束的行.在这一点上,我的系列看起来像这样(为了简洁,我遗漏了一些重复的值):
import numpy as np
import pandas
from pandas import Timestamp
datadict = {'event': {
Timestamp('2010-01-01 00:20:00', tz=None): 'event start',
Timestamp('2010-01-01 00:30:00', tz=None): '--',
Timestamp('2010-01-01 00:40:00', tz=None): '--',
Timestamp('2010-01-01 00:50:00', tz=None): '--',
Timestamp('2010-01-01 01:00:00', tz=None): '--',
Timestamp('2010-01-01 01:10:00', tz=None): 'event end',
Timestamp('2010-01-01 01:20:00', tz=None): '--',
Timestamp('2010-01-01 02:20:00', tz=None): '--',
Timestamp('2010-01-01 02:30:00', tz=None): 'event start',
Timestamp('2010-01-01 02:40:00', tz=None): '--',
Timestamp('2010-01-01 02:50:00', tz=None): '--',
Timestamp('2010-01-01 03:00:00', tz=None): '--',
Timestamp('2010-01-01 03:10:00', tz=None): '--',
Timestamp('2010-01-01 03:20:00', tz=None): '--',
Timestamp('2010-01-01 03:30:00', tz=None): 'event end',
}}
data = pandas.DataFrame.from_dict(datadict)
event
2010-01-01 00:20:00 event start
2010-01-01 00:30:00 --
2010-01-01 00:40:00 --
2010-01-01 00:50:00 --
2010-01-01 01:00:00 --
2010-01-01 01:10:00 event end
2010-01-01 01:20:00 --
2010-01-01 02:20:00 --
2010-01-01 02:30:00 event start
2010-01-01 02:40:00 --
2010-01-01 02:50:00 --
2010-01-01 03:00:00 --
2010-01-01 03:10:00 --
2010-01-01 03:20:00 --
2010-01-01 03:30:00 event end
Run Code Online (Sandbox Code Playgroud)
for
循环) event event number
2010-01-01 00:20:00 event start 1
2010-01-01 00:30:00 -- 1
2010-01-01 00:40:00 -- 1
2010-01-01 00:50:00 -- 1
2010-01-01 01:00:00 -- 1
2010-01-01 01:10:00 event end 1
2010-01-01 01:20:00 -- NA
2010-01-01 02:20:00 -- NA
2010-01-01 02:30:00 event start 2
2010-01-01 02:40:00 -- 2
2010-01-01 02:50:00 -- 2
2010-01-01 03:00:00 -- 2
2010-01-01 03:10:00 -- 2
2010-01-01 03:20:00 -- 2
2010-01-01 03:30:00 event end 2
2010-01-01 03:40:00 -- NA
2010-01-01 03:50:00 -- NA
Run Code Online (Sandbox Code Playgroud)
通过对我的数据质量的一些乐观假设,我可以获得如下事件数:
table = data[data.event != '--'].reset_index()
table['event number'] = 1 + np.floor(table.index / 2)
table = table.set_index('index')
event event number
index
2010-01-01 00:20:00 event start 1
2010-01-01 01:10:00 event end 1
2010-01-01 02:30:00 event start 2
2010-01-01 03:30:00 event end 2
Run Code Online (Sandbox Code Playgroud)
然后join
,我可以使用原始数据帧,并fillna
使用method='ffill'
data2 = data.join(table[['event number']])
data2['filled'] = data2['event number'].fillna(method='ffill')
event event number filled
2010-01-01 00:20:00 event start 1 1
2010-01-01 00:30:00 -- NaN 1
2010-01-01 00:40:00 -- NaN 1
2010-01-01 00:50:00 -- NaN 1
2010-01-01 01:00:00 -- NaN 1
2010-01-01 01:10:00 event end 1 1
2010-01-01 01:20:00 -- NaN 1 # <- d'oh
2010-01-01 02:20:00 -- NaN 1 # <- d'oh
2010-01-01 02:30:00 event start 2 2
2010-01-01 02:40:00 -- NaN 2
2010-01-01 02:50:00 -- NaN 2
2010-01-01 03:00:00 -- NaN 2
2010-01-01 03:10:00 -- NaN 2
2010-01-01 03:20:00 -- NaN 2
2010-01-01 03:30:00 event end 2 2
Run Code Online (Sandbox Code Playgroud)
如您所见,事件之间的时间(01:20到02:20)与事件#1相关联.
反正有没有循环跳过这些部分?
你可以通过查看数量event start
和数量的累计总和来实现这一点event end
:
>>> data['event number'] = (data.event == 'event start').cumsum()
>>> data
event event number
2010-01-01 00:20:00 event start 1
2010-01-01 00:30:00 -- 1
2010-01-01 00:40:00 -- 1
2010-01-01 00:50:00 -- 1
2010-01-01 01:00:00 -- 1
2010-01-01 01:10:00 event end 1
2010-01-01 01:20:00 -- 1
2010-01-01 02:20:00 -- 1
2010-01-01 02:30:00 event start 2
2010-01-01 02:40:00 -- 2
2010-01-01 02:50:00 -- 2
2010-01-01 03:00:00 -- 2
2010-01-01 03:10:00 -- 2
2010-01-01 03:20:00 -- 2
2010-01-01 03:30:00 event end 2
Run Code Online (Sandbox Code Playgroud)
现在你只需要设置nan
为没有事件; 但那些地方对应的累积总和event start
等于累计总和的event end
行(有1行)
>>> idx = data['event number'] == (data.event.shift(1) == 'event end').cumsum()
>>> data.loc[idx, 'event number'] = np.nan
>>> data
event event number
2010-01-01 00:20:00 event start 1
2010-01-01 00:30:00 -- 1
2010-01-01 00:40:00 -- 1
2010-01-01 00:50:00 -- 1
2010-01-01 01:00:00 -- 1
2010-01-01 01:10:00 event end 1
2010-01-01 01:20:00 -- NaN
2010-01-01 02:20:00 -- NaN
2010-01-01 02:30:00 event start 2
2010-01-01 02:40:00 -- 2
2010-01-01 02:50:00 -- 2
2010-01-01 03:00:00 -- 2
2010-01-01 03:10:00 -- 2
2010-01-01 03:20:00 -- 2
2010-01-01 03:30:00 event end 2
[15 rows x 2 columns]
Run Code Online (Sandbox Code Playgroud)