我正在处理不定期记录的财务数据.有些时间戳是重复的,这使分析变得棘手.这是数据的一个示例 - 请注意有四个2016-08-23 00:00:17.664193时间戳:
In [167]: ts
Out[168]:
last last_sz bid ask
datetime
2016-08-23 00:00:14.161128 2170.75 1 2170.75 2171.00
2016-08-23 00:00:14.901180 2171.00 1 2170.75 2171.00
2016-08-23 00:00:17.196639 2170.75 1 2170.75 2171.00
2016-08-23 00:00:17.664193 2171.00 1 2170.75 2171.00
2016-08-23 00:00:17.664193 2171.00 1 2170.75 2171.00
2016-08-23 00:00:17.664193 2171.00 2 2170.75 2171.00
2016-08-23 00:00:17.664193 2171.00 1 2170.75 2171.00
2016-08-23 00:00:26.206108 2170.75 2 2170.75 2171.00
2016-08-23 00:00:28.322456 2170.75 7 2170.75 2171.00
2016-08-23 00:00:28.322456 2170.75 1 2170.75 2171.00
Run Code Online (Sandbox Code Playgroud)
在此示例中,只有几个重复项,但在某些情况下,有数百个连续的行,所有行都共享相同的时间戳.我的目标是通过为每个副本添加1个额外纳秒来解决这个问题(因此,在连续4个相同时间戳的情况下,我将向第二个添加1ns,向第3个添加2ns,向第四个添加3ns.例如,以上数据将转换为:
In [169]: make_timestamps_unique(ts)
Out[170]:
last last_sz bid ask
newindex
2016-08-23 00:00:14.161128000 2170.75 1 2170.75 2171.0
2016-08-23 00:00:14.901180000 2171.00 1 2170.75 2171.0
2016-08-23 00:00:17.196639000 2170.75 1 2170.75 2171.0
2016-08-23 00:00:17.664193000 2171.00 1 2170.75 2171.0
2016-08-23 00:00:17.664193001 2171.00 1 2170.75 2171.0
2016-08-23 00:00:17.664193002 2171.00 2 2170.75 2171.0
2016-08-23 00:00:17.664193003 2171.00 1 2170.75 2171.0
2016-08-23 00:00:26.206108000 2170.75 2 2170.75 2171.0
2016-08-23 00:00:28.322456000 2170.75 7 2170.75 2171.0
2016-08-23 00:00:28.322456001 2170.75 1 2170.75 2171.0
Run Code Online (Sandbox Code Playgroud)
我一直在努力找到一个很好的方法来做到这一点 - 我目前的解决方案是进行多次传递,每次检查重复,并在一系列相同的时间戳中向第一个添加1ns.这是代码:
def make_timestamps_unique(ts):
mask = ts.index.duplicated('first')
duplicate_count = np.sum(mask)
passes = 0
while duplicate_count > 0:
ts.loc[:, 'newindex'] = ts.index
ts.loc[mask, 'newindex'] += pd.Timedelta('1ns')
ts = ts.set_index('newindex')
mask = ts.index.duplicated('first')
duplicate_count = np.sum(mask)
passes += 1
print('%d passes of duplication loop' % passes)
return ts
Run Code Online (Sandbox Code Playgroud)
这显然效率很低 - 它通常需要数百次传递,如果我在200万行数据帧上尝试它,我会得到一个MemoryError.有什么想法可以更好地实现这一目标吗?
这是一篇更快的numpy版本(但可读性稍差),这篇文章的灵感来自于这篇文章.这个想法是cumsum在重复的时间戳值上使用,同时在每次np.NaN遇到a时重置累积和:
# get duplicated values as float and replace 0 with NaN
values = df.index.duplicated(keep=False).astype(float)
values[values==0] = np.NaN
missings = np.isnan(values)
cumsum = np.cumsum(~missings)
diff = np.diff(np.concatenate(([0.], cumsum[missings])))
values[missings] = -diff
# print result
result = df.index + np.cumsum(values).astype(np.timedelta64)
print(result)
DatetimeIndex([ '2016-08-23 00:00:14.161128',
'2016-08-23 00:00:14.901180',
'2016-08-23 00:00:17.196639',
'2016-08-23 00:00:17.664193001',
'2016-08-23 00:00:17.664193002',
'2016-08-23 00:00:17.664193003',
'2016-08-23 00:00:17.664193004',
'2016-08-23 00:00:26.206108',
'2016-08-23 00:00:28.322456001',
'2016-08-23 00:00:28.322456002'],
dtype='datetime64[ns]', name='datetime', freq=None)
Run Code Online (Sandbox Code Playgroud)
定时此解决方案产生,10000 loops, best of 3: 107 µs per loop而@DYZ groupby/apply方法(但更可读)在虚拟数据上大约慢50倍100 loops, best of 3: 5.3 ms per loop.
当然,您必须重置索引,最后:
df.index = result
Run Code Online (Sandbox Code Playgroud)
您可以按索引对行进行分组,然后将一系列连续时间增量添加到每个组的索引中。我不确定是否可以直接使用索引来完成此操作,但是您可以先将索引转换为普通列,对该列进行操作,然后再次将该列设置为索引:
newindex = ts.reset_index()\
.groupby('datetime')['datetime']\
.apply(lambda x: x + np.arange(x.size).astype(np.timedelta64))
df.index = newindex
Run Code Online (Sandbox Code Playgroud)