blo*_*nri 4 python numpy pandas
我有超过 800,000 行的数据。我想取其中一列的指数移动平均线 (EMA)。时间不是均匀采样的,我想在每次更新(行)时衰减 EMA。我的代码是这样的:
window = 5
for i in range(1, len(series)):
dt = series['datetime'][i] - series['datetime'][i - 1]
decay = 1 - numpy.exp(-dt / window)
result[i] = (1 - decay) * result[i - 1] + decay * series['midpoint'].iloc[i]
return pandas.Series(result, index=series.index)
Run Code Online (Sandbox Code Playgroud)
问题是,对于 800,000 行,这非常慢。无论如何使用numpy的其他一些功能来优化它?我无法对其进行矢量化,因为results[i]它依赖于results[i-1].
示例数据在这里:
Timestamp Midpoint
1559655000001096130 2769.125
1559655000001162260 2769.127
1559655000001171688 2769.154
1559655000001408734 2769.138
1559655000001424200 2769.123
1559655000001433128 2769.110
1559655000001541560 2769.125
1559655000001640406 2769.125
1559655000001658436 2769.127
1559655000001755924 2769.129
1559655000001793266 2769.125
1559655000001878688 2769.143
1559655000002061024 2769.125
Run Code Online (Sandbox Code Playgroud)
像下面这样的东西怎么样,它需要我 0.34 秒来运行一系列具有 900k 行的不规则间隔数据?我假设 5 的窗口意味着 5 天的跨度。
首先,让我们创建一些示例数据。
# Create sample data for a price stream of 2.6m price observations sampled 1 second apart.
seconds_per_day = 60 * 60 * 24 # 60 seconds / minute * 60 minutes / hour * 24 hours / day
starting_value = 100
annualized_vol = .3
sampling_percentage = .35 # 35%
start_date = '2018-12-01'
end_date = '2018-12-31'
np.random.seed(0)
idx = pd.date_range(start=start_date, end=end_date, freq='s') # One second intervals.
periodic_vol = annualized_vol * (1/ 252 / seconds_per_day) ** 0.5
daily_returns = np.random.randn(len(idx)) * periodic_vol
cumulative_indexed_return = (1 + daily_returns).cumprod() * starting_value
index_level = pd.Series(cumulative_indexed_return, index=idx)
# Sample 35% of the simulated prices to create a time series of 907k rows with irregular time intervals.
s = index_level.sample(frac=sampling_percentage).sort_index()
Run Code Online (Sandbox Code Playgroud)
现在让我们创建一个生成器函数来存储指数加权时间序列的最新值。这可以运行 c。通过安装 numba,导入它,然后在函数定义上方添加单个装饰器行,速度提高了 4 倍@jit(nopython=True)。
from numba import jit # Optional, see below.
@jit(nopython=True) # Optional, see below.
def ewma(vals, decay_vals):
result = vals[0]
yield result
for val, decay in zip(vals[1:], decay_vals[1:]):
result = result * (1 - decay) + val * decay
yield result
Run Code Online (Sandbox Code Playgroud)
现在让我们在不规则间隔系列上运行这个生成器s。对于这个包含 900k 行的示例,我需要 1.2 秒来运行以下代码。通过可选地使用numba 的即时编译器,我可以进一步将执行时间减少到 0.34 秒。您首先需要安装该软件包,例如conda install numba. 请注意,我使用列表理解来填充ewma来自生成器的值,然后在首先将其转换为数据帧后将这些值分配回原始系列。
# Assumes time series data is now named `s`.
window = 5 # Span of 5 days?
dt = pd.Series(s.index).diff().dt.total_seconds().div(seconds_per_day) # Measured in days.
decay = (1 - (dt / -window).apply(np.exp))
g = ewma_generator(s.values, decay.values)
result = s.to_frame('midpoint').assign(
ewma=pd.Series([next(g) for _ in range(len(s))], index=s.index))
>>> result.tail()
midpoint ewma
2018-12-30 23:59:45 103.894471 105.546004
2018-12-30 23:59:49 103.914077 105.545929
2018-12-30 23:59:50 103.901910 105.545910
2018-12-30 23:59:53 103.913476 105.545853
2018-12-31 00:00:00 103.910422 105.545720
>>> result.shape
(907200, 2)
Run Code Online (Sandbox Code Playgroud)
为了确保数字符合我们的直觉,让我们将每小时采样的结果可视化。这对我来说看起来不错。
obs_per_day = 24 # 24 hourly observations per day.
step = int(seconds_per_day / obs_per_day)
>>> result.iloc[::step, :].plot()
Run Code Online (Sandbox Code Playgroud)