对具有不规则时间间隔的大型数据集进行快速 EMA 计算

blo*_*nri 4 python numpy pandas

我有超过 800,000 行的数据。我想取其中一列的指数移动平均线 (EMA)。时间不是均匀采样的,我想在每次更新(行)时衰减 EMA。我的代码是这样的:

window = 5            
for i in range(1, len(series)):
    dt = series['datetime'][i] - series['datetime'][i - 1]
    decay = 1 - numpy.exp(-dt / window)
    result[i] = (1 - decay) * result[i - 1] + decay * series['midpoint'].iloc[i]
return pandas.Series(result, index=series.index)
Run Code Online (Sandbox Code Playgroud)

问题是,对于 800,000 行,这非常慢。无论如何使用numpy的其他一些功能来优化它?我无法对其进行矢量化,因为results[i]它依赖于results[i-1].

示例数据在这里:

Timestamp             Midpoint
1559655000001096130    2769.125
1559655000001162260    2769.127
1559655000001171688    2769.154
1559655000001408734    2769.138
1559655000001424200    2769.123
1559655000001433128    2769.110
1559655000001541560    2769.125
1559655000001640406    2769.125
1559655000001658436    2769.127
1559655000001755924    2769.129
1559655000001793266    2769.125
1559655000001878688    2769.143
1559655000002061024    2769.125
Run Code Online (Sandbox Code Playgroud)

Ale*_*der 5

像下面这样的东西怎么样,它需要我 0.34 秒来运行一系列具有 900k 行的不规则间隔数据?我假设 5 的窗口意味着 5 天的跨度。

首先,让我们创建一些示例数据。

# Create sample data for a price stream of 2.6m price observations sampled 1 second apart.
seconds_per_day = 60 * 60 * 24  # 60 seconds / minute * 60 minutes / hour * 24 hours / day
starting_value = 100
annualized_vol = .3
sampling_percentage = .35  # 35%
start_date = '2018-12-01'
end_date = '2018-12-31'

np.random.seed(0)
idx = pd.date_range(start=start_date, end=end_date, freq='s')  # One second intervals.
periodic_vol = annualized_vol * (1/ 252 / seconds_per_day) ** 0.5
daily_returns = np.random.randn(len(idx)) * periodic_vol
cumulative_indexed_return = (1 + daily_returns).cumprod() * starting_value
index_level = pd.Series(cumulative_indexed_return, index=idx)

# Sample 35% of the simulated prices to create a time series of 907k rows with irregular time intervals.
s = index_level.sample(frac=sampling_percentage).sort_index()
Run Code Online (Sandbox Code Playgroud)

现在让我们创建一个生成器函数来存储指数加权时间序列的最新值。这可以运行 c。通过安装 numba,导入它,然后在函数定义上方添加单个装饰器行,速度提高了 4 倍@jit(nopython=True)

from numba import jit  # Optional, see below.

@jit(nopython=True)  # Optional, see below.
def ewma(vals, decay_vals):
    result = vals[0]
    yield result
    for val, decay in zip(vals[1:], decay_vals[1:]):
        result = result * (1 - decay) + val * decay
        yield result
Run Code Online (Sandbox Code Playgroud)

现在让我们在不规则间隔系列上运行这个生成器s。对于这个包含 900k 行的示例,我需要 1.2 秒来运行以下代码。通过可选地使用numba 的即时编译器,我可以进一步将执行时间减少到 0.34 秒。您首先需要安装该软件包,例如conda install numba. 请注意,我使用列表理解来填充ewma来自生成器的值,然后在首先将其转换为数据帧后将这些值分配回原始系列。

# Assumes time series data is now named `s`.
window = 5  # Span of 5 days?
dt = pd.Series(s.index).diff().dt.total_seconds().div(seconds_per_day)  # Measured in days.
decay = (1 - (dt / -window).apply(np.exp))
g = ewma_generator(s.values, decay.values)
result = s.to_frame('midpoint').assign(
    ewma=pd.Series([next(g) for _ in range(len(s))], index=s.index))

>>> result.tail()
                       midpoint        ewma
2018-12-30 23:59:45  103.894471  105.546004
2018-12-30 23:59:49  103.914077  105.545929
2018-12-30 23:59:50  103.901910  105.545910
2018-12-30 23:59:53  103.913476  105.545853
2018-12-31 00:00:00  103.910422  105.545720

>>> result.shape
(907200, 2)
Run Code Online (Sandbox Code Playgroud)

为了确保数字符合我们的直觉,让我们将每小时采样的结果可视化。这对我来说看起来不错。

obs_per_day = 24  # 24 hourly observations per day.
step = int(seconds_per_day / obs_per_day)
>>> result.iloc[::step, :].plot()
Run Code Online (Sandbox Code Playgroud)

在此处输入图片说明