如何在python中创建时间序列数据的线性回归预测

use*_*980 23 python

我需要能够基于线性回归模型创建一个用于预测的python函数,并在时间序列数据上使用置信带:

该函数需要一个参数来指定预测的距离.例如1天,7天,30天,90天等.根据参数,它将需要创建带有置信区段的Holt-Winters预测:

我的时间序列数据如下所示:

print series

[{"target": "average", "datapoints": [[null, 1435688679], [34.870499801635745, 1435688694], [null, 1435688709], [null, 1435688724], [null, 1435688739], [null, 1435688754], [null, 1435688769], [null, 1435688784], [null, 1435688799], [null, 1435688814], [null, 1435688829], [null, 1435688844], [null, 1435688859], [null, 1435688874], [null, 1435688889], [null, 1435688904], [null, 1435688919], [null, 1435688934], [null, 1435688949], [null, 1435688964], [null, 1435688979], [38.180000209808348, 1435688994], [null, 1435689009], [null, 1435689024], [null, 1435689039], [null, 1435689054], [null, 1435689069], [null, 1435689084], [null, 1435689099], [null, 1435689114], [null, 1435689129], [null, 1435689144], [null, 1435689159], [null, 1435689174], [null, 1435689189], [null, 1435689204], [null, 1435689219], [null, 1435689234], [null, 1435689249], [null, 1435689264], [null, 1435689279], [30.79849989414215, 1435689294], [null, 1435689309], [null, 1435689324], [null, 1435689339], [null, 1435689354], [null, 1435689369], [null, 1435689384], [null, 1435689399], [null, 1435689414], [null, 1435689429], [null, 1435689444], [null, 1435689459], [null, 1435689474], [null, 1435689489], [null, 1435689504], [null, 1435689519], [null, 1435689534], [null, 1435689549], [null, 1435689564]]}]
Run Code Online (Sandbox Code Playgroud)

该函数应将预测值附加到上述称为"系列"的时间序列数据并返回系列:

[{"target": "average", "datapoints": [[null, 1435688679], [34.870499801635745, 1435688694], [null, 1435688709], [null, 1435688724], [null, 1435688739], [null, 1435688754], [null, 1435688769], [null, 1435688784], [null, 1435688799], [null, 1435688814], [null, 1435688829], [null, 1435688844], [null, 1435688859], [null, 1435688874], [null, 1435688889], [null, 1435688904], [null, 1435688919], [null, 1435688934], [null, 1435688949], [null, 1435688964], [null, 1435688979], [38.180000209808348, 1435688994], [null, 1435689009], [null, 1435689024], [null, 1435689039], [null, 1435689054], [null, 1435689069], [null, 1435689084], [null, 1435689099], [null, 1435689114], [null, 1435689129], [null, 1435689144], [null, 1435689159], [null, 1435689174], [null, 1435689189], [null, 1435689204], [null, 1435689219], [null, 1435689234], [null, 1435689249], [null, 1435689264], [null, 1435689279], [30.79849989414215, 1435689294], [null, 1435689309], [null, 1435689324], [null, 1435689339], [null, 1435689354], [null, 1435689369], [null, 1435689384], [null, 1435689399], [null, 1435689414], [null, 1435689429], [null, 1435689444], [null, 1435689459], [null, 1435689474], [null, 1435689489], [null, 1435689504], [null, 1435689519], [null, 1435689534], [null, 1435689549], [null, 1435689564]]},{"target": "Forecast", "datapoints": [[186.77999925613403, 1435520801], [178.95000147819519, 1435521131]]},{"target": "Upper", "datapoints": [[186.77999925613403, 1435520801], [178.95000147819519, 1435521131]]},{"target": "Lower", "datapoints": [[186.77999925613403, 1435520801], [178.95000147819519, 1435521131]]}]
Run Code Online (Sandbox Code Playgroud)

有没有人在python中做过这样的事情?任何想法如何开始?

J R*_*ape 26

在您的问题文本中,您清楚地表明您希望回归输出的上限和下限以及输出预测.您还提到使用Holt-Winters算法进行预测.

其他sklearn回答者建议的软件包很有用,但你可能会注意到LinearRegression并没有给你提供"开箱即用"的错误界限,statsmodels 现在不提供Holt-Winters.

因此,我建议尝试使用Holt-Winters的这种实现.不幸的是它的许可证还不清楚,所以我不能在这里完整地复制它.现在,我不确定你是否真的想要Holt-Winters(季节性)预测,或Holt的线性指数平滑算法.我猜测后者给出了帖子的标题.因此,您可以使用linear()链接库的功能.这里对感兴趣的读者详细描述了该技术.

为了不提供仅链接的答案 - 我将在这里描述主要功能.定义了一个获取数据的函数,即

 def linear(x, fc, alpha = None, beta = None):
Run Code Online (Sandbox Code Playgroud)

x是要拟合的数据,fc是你想要预测的时间步数,alpha和beta采用他们通常的Holt-Winters含义:大致是一个参数来分别控制平滑到"水平"和"趋势"的数量.如果alpha或未beta指定,则使用它们来估计它们scipy.optimize.fmin_l_bfgs_b 以最小化RMSE.

该函数通过循环现有数据点简单地应用Holt算法,然后返回预测,如下所示:

 return Y[-fc:], alpha, beta, rmse
Run Code Online (Sandbox Code Playgroud)

其中Y[-fc:]是预测点,alpha并且beta是实际使用,并且值rmse是均方根误差.不幸的是,正如您所看到的,没有上限或下限置信区间.顺便说一下 - 我们应该将它们称为预测间隔.

预测区间数学

Holt的算法和Holt-Winters算法是指数平滑技术,找到由它们生成的预测的置信区间是一个棘手的主题.它们被称为"经验法则"方法,并且在Holt-Winters乘法算法的情况下,没有"基础统计模型".但是,本页最后一个脚注声称:

通过将它们视为ARIMA模型的特殊情况,可以计算由指数平滑模型产生的长期预测周围的置信区间.(注意:并非所有软件都能正确计算这些模型的置信区间.)置信区间的宽度取决于(i)模型的RMS误差,(ii)平滑类型(简单或线性); (iii)平滑常数的值(s); (iv)您预测的未来期数.通常,随着α在SES模型中变大,间隔扩散得更快,并且当使用线性而不是简单平滑时,它们扩散得更快.

我们在这里看到ARIMA(0,2,2)模型等效于具有加性误差的Holt线性模型

预测间隔代码(即如何继续)

您在评论中表明您"可以轻松地在R中执行此操作".我想你可能习惯holt()forecast包中提供的功能 R,因此期望这样的间隔.在这种情况下 - 您可以调整python库,以便在相同的基础上将它们提供给您.

查看R holt代码,我们可以看到它返回一个基于的对象forecast(ets(...).在引擎盖下 - 这最终调用了 这个函数class1,它返回一个均值mu和方差 var(以及cj我必须承认我不理解的).方差用于计算此处的上限和下限.

要在Python中做类似的事情 - 我们需要生成类似于class1R函数的东西来估计每个预测的方差.此函数获取模型拟合中发现的残差,并在每个时间步长乘以一个因子,以得到该时间步长的方差. 在线性Holt算法的特定情况下,因子是alpha + k*beta 其中k是时间步长预测的数量的累积和.在每个预测点处获得该方差后,将错误视为正态分布,并从正态分布中获取X%值.

这里有一个想法如何在Python中执行此操作(使用我链接的代码作为线性函数)

#Copy, import or reimplement the RMSE and linear function from
#https://gist.github.com/andrequeiroz/5888967

#factor in case there are not 1 timestep per day - in your case
#assuming the timesteps are UTC epoch - I think they're 5 min
# spaced i.e. 288 per day
timesteps_per_day = 288

# Note - big assumption here - your known data will be regular in time
# i.e. timesteps_per_day observations per day.  From the timestamps this seems valid.
# if you can't guarantee that - you'll need to interpolate the data
def holt_predict(data, timestamps, forecast_days, pred_error_level = 0.95):
    forecast_timesteps = forecast_days*timesteps_per_day
    middle_predictions, alpha, beta, rmse = linear(data,int(forecast_timesteps))
    cum_error = [beta+alpha]
    for k in range(1,forecast_timesteps):
        cum_error.append(cum_error[k-1] + k*beta + alpha)

    cum_error = np.array(cum_error)
    #Use some numpy multiplication to get the intervals
    var = cum_error * rmse**2
    # find the correct ppf on the normal distribution (two-sided)
    p = abs(scipy.stats.norm.ppf((1-pred_error_level)/2))
    interval = np.sqrt(var) * p
    upper = middle_predictions + interval
    lower = middle_predictions - interval
    fcast_timestamps = [timestamps[-1] + i * 86400 / timesteps_per_day for i in range(forecast_timesteps)]

    ret_value = []

    ret_value.append({'target':'Forecast','datapoints': zip(middle_predictions, fcast_timestamps)})
    ret_value.append({'target':'Upper','datapoints':zip(upper,fcast_timestamps)})
    ret_value.append({'target':'Lower','datapoints':zip(lower,fcast_timestamps)})
    return ret_value

if __name__=='__main__':
    import numpy as np
    import scipy.stats
    from math import sqrt

    null = None
    data_in = [{"target": "average", "datapoints": [[null, 1435688679],
    [34.870499801635745, 1435688694], [null, 1435688709], [null,
    1435688724], [null, 1435688739], [null, 1435688754], [null, 1435688769],
    [null, 1435688784], [null, 1435688799], [null, 1435688814], [null,
    1435688829], [null, 1435688844], [null, 1435688859], [null, 1435688874],
    [null, 1435688889], [null, 1435688904], [null, 1435688919], [null,
    1435688934], [null, 1435688949], [null, 1435688964], [null, 1435688979],
    [38.180000209808348, 1435688994], [null, 1435689009], [null,
    1435689024], [null, 1435689039], [null, 1435689054], [null, 1435689069],
    [null, 1435689084], [null, 1435689099], [null, 1435689114], [null,
    1435689129], [null, 1435689144], [null, 1435689159], [null, 1435689174],
    [null, 1435689189], [null, 1435689204], [null, 1435689219], [null,
    1435689234], [null, 1435689249], [null, 1435689264], [null, 1435689279],
    [30.79849989414215, 1435689294], [null, 1435689309], [null, 1435689324],
    [null, 1435689339], [null, 1435689354], [null, 1435689369], [null,
    1435689384], [null, 1435689399], [null, 1435689414], [null, 1435689429],
    [null, 1435689444], [null, 1435689459], [null, 1435689474], [null,
    1435689489], [null, 1435689504], [null, 1435689519], [null, 1435689534],
    [null, 1435689549], [null, 1435689564]]}]

    #translate the data.  There may be better ways if you're
    #prepared to use pandas / input data is proper json
    time_series = data_in[0]["datapoints"]

    epoch_in = []
    Y_observed = []

    for (y,x) in time_series:
        if y and x:
            epoch_in.append(x)
            Y_observed.append(y)

    #Pass in the number of days to forecast
    fcast_days = 30
    res = holt_predict(Y_observed,epoch_in,fcast_days)
    data_out = data_in + res
    #data_out now holds the data as you wanted it.

    #Optional plot of results
    import matplotlib.pyplot as plt
    plt.plot(epoch_in,Y_observed)
    m,tstamps = zip(*res[0]['datapoints'])
    u,tstamps = zip(*res[1]['datapoints'])
    l,tstamps = zip(*res[2]['datapoints'])
    plt.plot(tstamps,u, label='upper')
    plt.plot(tstamps,l, label='lower')
    plt.plot(tstamps,m, label='mean')
    plt.show()
Run Code Online (Sandbox Code Playgroud)

NB我给出的输出tuple在您的对象中添加了类型的点.如果你真的需要list,然后更换zip(upper,fcast_timestamps)map(list,zip(upper,fcast_timestamps))其中代码添加upper,lowerForecasthttp://stardict.sourceforge.net/Dictionaries.php下载到的结果.

此代码适用于Holt线性算法的特定情况 - 它不是计算正确预测间隔的通用方法.

重要的提示

您的样本输入数据似乎有很多null且只有3个真正的数据点.这不太可能成为进行时间序列预测的良好基础 - 特别是因为它们似乎都在15分钟内,并且您试图预测长达3个月!确实 - 如果你把这些数据输入R holt(),它会说:

You've got to be joking. I need more data!

我假设你有一个更大的数据集来测试.我在2015年的股市开盘价上尝试了上述代码,似乎给出了合理的结果(见下文).

代码用于2015年股市开盘价

您可能认为预测间隔看起来有点宽.来自R预测模块的作者的这篇博客暗示这是有意的,但是:

"均值的置信区间比预测区间窄得多"


par*_*par 5

Scikit是python的一个很好的模块

X和Y变量必须分成两个数组,如果你要绘制它们(X,Y),一个索引将与另一个数组匹配.

所以我想在你的时间序列数据中你会将X和Y值分开如下:

null = None
time_series = [{"target": "average", "datapoints": [[null, 1435688679], [34.870499801635745, 1435688694], [null, 1435688709], [null, 1435688724], [null, 1435688739], [null, 1435688754], [null, 1435688769], [null, 1435688784], [null, 1435688799], [null, 1435688814], [null, 1435688829], [null, 1435688844], [null, 1435688859], [null, 1435688874], [null, 1435688889], [null, 1435688904], [null, 1435688919], [null, 1435688934], [null, 1435688949], [null, 1435688964], [null, 1435688979], [38.180000209808348, 1435688994], [null, 1435689009], [null, 1435689024], [null, 1435689039], [null, 1435689054], [null, 1435689069], [null, 1435689084], [null, 1435689099], [null, 1435689114], [null, 1435689129], [null, 1435689144], [null, 1435689159], [null, 1435689174], [null, 1435689189], [null, 1435689204], [null, 1435689219], [null, 1435689234], [null, 1435689249], [null, 1435689264], [null, 1435689279], [30.79849989414215, 1435689294], [null, 1435689309], [null, 1435689324], [null, 1435689339], [null, 1435689354], [null, 1435689369], [null, 1435689384], [null, 1435689399], [null, 1435689414], [null, 1435689429], [null, 1435689444], [null, 1435689459], [null, 1435689474], [null, 1435689489], [null, 1435689504], [null, 1435689519], [null, 1435689534], [null, 1435689549], [null, 1435689564]]}]

 # assuming the time series is the X axis value

input_X_vars = []
input_Y_vars = []

for pair in time_series[0]["datapoints"]:
    if (pair[0] != None and pair[1] != None):
        input_X_vars.append(pair[1])
        input_Y_vars.append(pair[0])



import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model

regr = linear_model.LinearRegression()
regr.fit(input_X_vars, input_Y_vars)

test_X_vars = [1435688681, 1435688685]

results = regr.predict(test_X_vars)

forecast_append = {"target": "Lower", "datapoints": results}

time_series.append(forecast_append)
Run Code Online (Sandbox Code Playgroud)

我将null设置为None,因为'null'关键字被解析为python中的一个变量.


dno*_*zay 2

注意:这不是详细的规范答案,而是对适用于您的领域(统计模型)的可用库的引用。

对于Python,你可以使用:

有一些不错的文章: