聚合 pandas 中的时间戳

chr*_*ock 1 python pandas

我想找到每秒的最大买卖价差。假设我有这个报价文件:

In [1]: !head quotes.txt
exchtime|bid|ask
1389178814.587758|520.0000|541.0000
1389178830.462050|540.4300|540.8700
1389178830.462050|540.4300|540.8700
1389178830.468602|540.4300|540.8600
1389178830.468602|540.4300|540.8600
1389178847.67500|540.4300|540.8500
1389178847.67500|540.4300|540.8500
1389178847.73541|540.4300|540.8400
1389178847.73541|540.4300|540.8400
Run Code Online (Sandbox Code Playgroud)

时间戳只是自 UTC 纪元以来的秒数。通过第一列的一些技巧,我可以像这样读取文件:

import pandas as pd
import numpy as np
from datetime import datetime

def convert(x): return np.datetime64(datetime.fromtimestamp(float(x)).isoformat())

df = pd.read_csv('quotes.txt', sep='|', parse_dates=True, converters={0:convert})
Run Code Online (Sandbox Code Playgroud)

这产生了我想要的:

In [10]: df.head()
Out[10]:
                    exchtime     bid     ask
0 2014-01-08 11:00:14.587758  520.00  541.00
1 2014-01-08 11:00:30.462050  540.43  540.87
2 2014-01-08 11:00:30.462050  540.43  540.87
3 2014-01-08 11:00:30.468602  540.43  540.86
4 2014-01-08 11:00:30.468602  540.43  540.86
Run Code Online (Sandbox Code Playgroud)

我对聚合感到困惑。在 q/kdb+ 中,我会简单地执行以下操作:

select spread:max ask-bid by exchtime.second from df
Run Code Online (Sandbox Code Playgroud)

我在 Pandas 中想到的是

df['spread'] = df.ask - df.bid
df['exchtime_sec'] = [e.replace(microsecond=0) for e in df.exchtime]
df.groupby('exchtime_sec')['spread'].agg(np.max)
Run Code Online (Sandbox Code Playgroud)

这似乎可行,但该exchtime_sec线路的运行时间比预期长大约三个数量级!是否有更快(更简洁)的方式来表达这种聚合?

Jef*_*eff 5

像这样读入,无需使用转换器转换时间

In [11]: df = read_csv(StringIO(data),sep='|')
Run Code Online (Sandbox Code Playgroud)

这要快得多

In [12]: df['exchtime'] = pd.to_datetime(df['exchtime'],unit='s')

In [13]: df
Out[13]: 
                    exchtime     bid     ask
0 2014-01-08 11:00:14.587758  520.00  541.00
1 2014-01-08 11:00:30.462050  540.43  540.87
2 2014-01-08 11:00:30.462050  540.43  540.87
3 2014-01-08 11:00:30.468602  540.43  540.86
4 2014-01-08 11:00:30.468602  540.43  540.86
5 2014-01-08 11:00:47.675000  540.43  540.85
6 2014-01-08 11:00:47.675000  540.43  540.85
7 2014-01-08 11:00:47.735410  540.43  540.84
8 2014-01-08 11:00:47.735410  540.43  540.84

[9 rows x 3 columns]
Run Code Online (Sandbox Code Playgroud)

创建传播列

In [15]: df['spread'] = df.ask-df.bid
Run Code Online (Sandbox Code Playgroud)

将索引设置为 exchtime,以 1 秒间隔重新采样,并为聚合器取最大值

In [16]: df.set_index('exchtime').resample('1s',how=np.max)
Out[16]: 
                        bid     ask  spread
exchtime                                   
2014-01-08 11:00:14  520.00  541.00   21.00
2014-01-08 11:00:15     NaN     NaN     NaN
2014-01-08 11:00:16     NaN     NaN     NaN
2014-01-08 11:00:17     NaN     NaN     NaN
2014-01-08 11:00:18     NaN     NaN     NaN
2014-01-08 11:00:19     NaN     NaN     NaN
2014-01-08 11:00:20     NaN     NaN     NaN
2014-01-08 11:00:21     NaN     NaN     NaN
2014-01-08 11:00:22     NaN     NaN     NaN
2014-01-08 11:00:23     NaN     NaN     NaN
2014-01-08 11:00:24     NaN     NaN     NaN
2014-01-08 11:00:25     NaN     NaN     NaN
2014-01-08 11:00:26     NaN     NaN     NaN
2014-01-08 11:00:27     NaN     NaN     NaN
2014-01-08 11:00:28     NaN     NaN     NaN
2014-01-08 11:00:29     NaN     NaN     NaN
2014-01-08 11:00:30  540.43  540.87    0.44
2014-01-08 11:00:31     NaN     NaN     NaN
2014-01-08 11:00:32     NaN     NaN     NaN
2014-01-08 11:00:33     NaN     NaN     NaN
2014-01-08 11:00:34     NaN     NaN     NaN
2014-01-08 11:00:35     NaN     NaN     NaN
2014-01-08 11:00:36     NaN     NaN     NaN
2014-01-08 11:00:37     NaN     NaN     NaN
2014-01-08 11:00:38     NaN     NaN     NaN
2014-01-08 11:00:39     NaN     NaN     NaN
2014-01-08 11:00:40     NaN     NaN     NaN
2014-01-08 11:00:41     NaN     NaN     NaN
2014-01-08 11:00:42     NaN     NaN     NaN
2014-01-08 11:00:43     NaN     NaN     NaN
2014-01-08 11:00:44     NaN     NaN     NaN
2014-01-08 11:00:45     NaN     NaN     NaN
2014-01-08 11:00:46     NaN     NaN     NaN
2014-01-08 11:00:47  540.43  540.85    0.42

[34 rows x 3 columns]
Run Code Online (Sandbox Code Playgroud)

性能比较

In [10]: df = DataFrame(np.random.randn(100000,2),index=date_range('20130101',periods=100000,freq='50U'))

In [7]: def f1(df):
   ...:     df = df.copy()
   ...:     df['seconds'] = [ e.replace(microsecond=0) for e in df.index ]
   ...:     df.groupby('seconds')[0].agg(np.max)
   ...:     

In [11]: def f2(df):
   ....:     df = df.copy()
   ....:     df.resample('1s',how=np.max)
   ....:     

In [8]: %timeit f1(df)
1 loops, best of 3: 692 ms per loop

In [12]: %timeit f2(df)
100 loops, best of 3: 2.36 ms per loop
Run Code Online (Sandbox Code Playgroud)

这是另一种方法,频率较低时速度更快。(最高/最低价相当于最大/最小,其中开盘价在前,收盘价在最后)。

In [9]: df = DataFrame(np.random.randn(100000,2),index=date_range('20130101',periods=100000,freq='50L'))

In [10]: df.groupby(pd.TimeGrouper('1s'))[0].ohlc()
Out[10]: 
In [11]: %timeit df.groupby(pd.TimeGrouper('1s'))[0].ohlc()
1000 loops, best of 3: 1.2 ms per loop
Run Code Online (Sandbox Code Playgroud)