我需要能够基于线性回归模型创建一个用于预测的python函数,并在时间序列数据上使用置信带:
该函数需要一个参数来指定预测的距离.例如1天,7天,30天,90天等.根据参数,它将需要创建带有置信区段的Holt-Winters预测:
我的时间序列数据如下所示:
print series
[{"target": "average", "datapoints": [[null, 1435688679], [34.870499801635745, 1435688694], [null, 1435688709], [null, 1435688724], [null, 1435688739], [null, 1435688754], [null, 1435688769], [null, 1435688784], [null, 1435688799], [null, 1435688814], [null, 1435688829], [null, 1435688844], [null, 1435688859], [null, 1435688874], [null, 1435688889], [null, 1435688904], [null, 1435688919], [null, 1435688934], [null, 1435688949], [null, 1435688964], [null, 1435688979], [38.180000209808348, 1435688994], [null, 1435689009], [null, 1435689024], [null, 1435689039], [null, 1435689054], [null, 1435689069], [null, 1435689084], [null, 1435689099], [null, 1435689114], [null, 1435689129], [null, 1435689144], [null, 1435689159], [null, 1435689174], [null, 1435689189], [null, 1435689204], [null, 1435689219], [null, 1435689234], [null, 1435689249], [null, 1435689264], [null, 1435689279], [30.79849989414215, 1435689294], [null, 1435689309], [null, 1435689324], [null, 1435689339], [null, 1435689354], [null, 1435689369], [null, 1435689384], [null, 1435689399], [null, 1435689414], [null, 1435689429], [null, 1435689444], [null, 1435689459], [null, 1435689474], [null, 1435689489], [null, 1435689504], [null, 1435689519], [null, 1435689534], [null, 1435689549], [null, 1435689564]]}]
Run Code Online (Sandbox Code Playgroud)
该函数应将预测值附加到上述称为"系列"的时间序列数据并返回系列:
[{"target": "average", "datapoints": [[null, 1435688679], [34.870499801635745, 1435688694], [null, 1435688709], [null, 1435688724], [null, 1435688739], [null, 1435688754], [null, 1435688769], [null, 1435688784], [null, 1435688799], [null, 1435688814], [null, 1435688829], [null, 1435688844], [null, 1435688859], [null, 1435688874], [null, 1435688889], [null, 1435688904], [null, 1435688919], [null, 1435688934], [null, 1435688949], [null, 1435688964], [null, 1435688979], [38.180000209808348, 1435688994], [null, 1435689009], [null, 1435689024], [null, 1435689039], [null, 1435689054], [null, 1435689069], [null, 1435689084], [null, 1435689099], [null, 1435689114], [null, 1435689129], [null, 1435689144], [null, 1435689159], [null, 1435689174], [null, 1435689189], [null, 1435689204], [null, 1435689219], [null, 1435689234], [null, 1435689249], [null, 1435689264], [null, 1435689279], [30.79849989414215, 1435689294], [null, 1435689309], [null, 1435689324], [null, 1435689339], [null, 1435689354], [null, 1435689369], [null, 1435689384], [null, 1435689399], [null, 1435689414], [null, 1435689429], [null, 1435689444], [null, 1435689459], [null, 1435689474], [null, 1435689489], [null, 1435689504], [null, 1435689519], [null, 1435689534], [null, 1435689549], [null, 1435689564]]},{"target": "Forecast", "datapoints": [[186.77999925613403, 1435520801], [178.95000147819519, 1435521131]]},{"target": "Upper", "datapoints": [[186.77999925613403, 1435520801], [178.95000147819519, 1435521131]]},{"target": "Lower", "datapoints": [[186.77999925613403, 1435520801], [178.95000147819519, 1435521131]]}]
Run Code Online (Sandbox Code Playgroud)
有没有人在python中做过这样的事情?任何想法如何开始?
J R*_*ape 26
在您的问题文本中,您清楚地表明您希望回归输出的上限和下限以及输出预测.您还提到使用Holt-Winters算法进行预测.
其他sklearn回答者建议的软件包很有用,但你可能会注意到LinearRegression并没有给你提供"开箱即用"的错误界限,statsmodels 现在不提供Holt-Winters.
因此,我建议尝试使用Holt-Winters的这种实现.不幸的是它的许可证还不清楚,所以我不能在这里完整地复制它.现在,我不确定你是否真的想要Holt-Winters(季节性)预测,或Holt的线性指数平滑算法.我猜测后者给出了帖子的标题.因此,您可以使用linear()链接库的功能.这里对感兴趣的读者详细描述了该技术.
为了不提供仅链接的答案 - 我将在这里描述主要功能.定义了一个获取数据的函数,即
def linear(x, fc, alpha = None, beta = None):
Run Code Online (Sandbox Code Playgroud)
x是要拟合的数据,fc是你想要预测的时间步数,alpha和beta采用他们通常的Holt-Winters含义:大致是一个参数来分别控制平滑到"水平"和"趋势"的数量.如果alpha或未beta指定,则使用它们来估计它们scipy.optimize.fmin_l_bfgs_b
以最小化RMSE.
该函数通过循环现有数据点简单地应用Holt算法,然后返回预测,如下所示:
return Y[-fc:], alpha, beta, rmse
Run Code Online (Sandbox Code Playgroud)
其中Y[-fc:]是预测点,alpha并且beta是实际使用,并且值rmse是均方根误差.不幸的是,正如您所看到的,没有上限或下限置信区间.顺便说一下 - 我们应该将它们称为预测间隔.
Holt的算法和Holt-Winters算法是指数平滑技术,找到由它们生成的预测的置信区间是一个棘手的主题.它们被称为"经验法则"方法,并且在Holt-Winters乘法算法的情况下,没有"基础统计模型".但是,本页的 最后一个脚注声称:
通过将它们视为ARIMA模型的特殊情况,可以计算由指数平滑模型产生的长期预测周围的置信区间.(注意:并非所有软件都能正确计算这些模型的置信区间.)置信区间的宽度取决于(i)模型的RMS误差,(ii)平滑类型(简单或线性); (iii)平滑常数的值(s); (iv)您预测的未来期数.通常,随着α在SES模型中变大,间隔扩散得更快,并且当使用线性而不是简单平滑时,它们扩散得更快.
我们在这里看到ARIMA(0,2,2)模型等效于具有加性误差的Holt线性模型
您在评论中表明您"可以轻松地在R中执行此操作".我想你可能习惯holt()了forecast包中提供的功能
R,因此期望这样的间隔.在这种情况下 - 您可以调整python库,以便在相同的基础上将它们提供给您.
查看R holt代码,我们可以看到它返回一个基于的对象forecast(ets(...).在引擎盖下 - 这最终调用了
这个函数class1,它返回一个均值mu和方差
var(以及cj我必须承认我不理解的).方差用于计算此处的上限和下限.
要在Python中做类似的事情 - 我们需要生成类似于class1R函数的东西来估计每个预测的方差.此函数获取模型拟合中发现的残差,并在每个时间步长乘以一个因子,以得到该时间步长的方差. 在线性Holt算法的特定情况下,因子是alpha + k*beta
其中k是时间步长预测的数量的累积和.在每个预测点处获得该方差后,将错误视为正态分布,并从正态分布中获取X%值.
这里有一个想法如何在Python中执行此操作(使用我链接的代码作为线性函数)
#Copy, import or reimplement the RMSE and linear function from
#https://gist.github.com/andrequeiroz/5888967
#factor in case there are not 1 timestep per day - in your case
#assuming the timesteps are UTC epoch - I think they're 5 min
# spaced i.e. 288 per day
timesteps_per_day = 288
# Note - big assumption here - your known data will be regular in time
# i.e. timesteps_per_day observations per day. From the timestamps this seems valid.
# if you can't guarantee that - you'll need to interpolate the data
def holt_predict(data, timestamps, forecast_days, pred_error_level = 0.95):
forecast_timesteps = forecast_days*timesteps_per_day
middle_predictions, alpha, beta, rmse = linear(data,int(forecast_timesteps))
cum_error = [beta+alpha]
for k in range(1,forecast_timesteps):
cum_error.append(cum_error[k-1] + k*beta + alpha)
cum_error = np.array(cum_error)
#Use some numpy multiplication to get the intervals
var = cum_error * rmse**2
# find the correct ppf on the normal distribution (two-sided)
p = abs(scipy.stats.norm.ppf((1-pred_error_level)/2))
interval = np.sqrt(var) * p
upper = middle_predictions + interval
lower = middle_predictions - interval
fcast_timestamps = [timestamps[-1] + i * 86400 / timesteps_per_day for i in range(forecast_timesteps)]
ret_value = []
ret_value.append({'target':'Forecast','datapoints': zip(middle_predictions, fcast_timestamps)})
ret_value.append({'target':'Upper','datapoints':zip(upper,fcast_timestamps)})
ret_value.append({'target':'Lower','datapoints':zip(lower,fcast_timestamps)})
return ret_value
if __name__=='__main__':
import numpy as np
import scipy.stats
from math import sqrt
null = None
data_in = [{"target": "average", "datapoints": [[null, 1435688679],
[34.870499801635745, 1435688694], [null, 1435688709], [null,
1435688724], [null, 1435688739], [null, 1435688754], [null, 1435688769],
[null, 1435688784], [null, 1435688799], [null, 1435688814], [null,
1435688829], [null, 1435688844], [null, 1435688859], [null, 1435688874],
[null, 1435688889], [null, 1435688904], [null, 1435688919], [null,
1435688934], [null, 1435688949], [null, 1435688964], [null, 1435688979],
[38.180000209808348, 1435688994], [null, 1435689009], [null,
1435689024], [null, 1435689039], [null, 1435689054], [null, 1435689069],
[null, 1435689084], [null, 1435689099], [null, 1435689114], [null,
1435689129], [null, 1435689144], [null, 1435689159], [null, 1435689174],
[null, 1435689189], [null, 1435689204], [null, 1435689219], [null,
1435689234], [null, 1435689249], [null, 1435689264], [null, 1435689279],
[30.79849989414215, 1435689294], [null, 1435689309], [null, 1435689324],
[null, 1435689339], [null, 1435689354], [null, 1435689369], [null,
1435689384], [null, 1435689399], [null, 1435689414], [null, 1435689429],
[null, 1435689444], [null, 1435689459], [null, 1435689474], [null,
1435689489], [null, 1435689504], [null, 1435689519], [null, 1435689534],
[null, 1435689549], [null, 1435689564]]}]
#translate the data. There may be better ways if you're
#prepared to use pandas / input data is proper json
time_series = data_in[0]["datapoints"]
epoch_in = []
Y_observed = []
for (y,x) in time_series:
if y and x:
epoch_in.append(x)
Y_observed.append(y)
#Pass in the number of days to forecast
fcast_days = 30
res = holt_predict(Y_observed,epoch_in,fcast_days)
data_out = data_in + res
#data_out now holds the data as you wanted it.
#Optional plot of results
import matplotlib.pyplot as plt
plt.plot(epoch_in,Y_observed)
m,tstamps = zip(*res[0]['datapoints'])
u,tstamps = zip(*res[1]['datapoints'])
l,tstamps = zip(*res[2]['datapoints'])
plt.plot(tstamps,u, label='upper')
plt.plot(tstamps,l, label='lower')
plt.plot(tstamps,m, label='mean')
plt.show()
Run Code Online (Sandbox Code Playgroud)
NB我给出的输出tuple在您的对象中添加了类型的点.如果你真的需要list,然后更换zip(upper,fcast_timestamps)与map(list,zip(upper,fcast_timestamps))其中代码添加upper,lower并Forecasthttp://stardict.sourceforge.net/Dictionaries.php下载到的结果.
此代码适用于Holt线性算法的特定情况 - 它不是计算正确预测间隔的通用方法.
您的样本输入数据似乎有很多null且只有3个真正的数据点.这不太可能成为进行时间序列预测的良好基础 - 特别是因为它们似乎都在15分钟内,并且您试图预测长达3个月!确实 - 如果你把这些数据输入R
holt(),它会说:
You've got to be joking. I need more data!
我假设你有一个更大的数据集来测试.我在2015年的股市开盘价上尝试了上述代码,似乎给出了合理的结果(见下文).

您可能认为预测间隔看起来有点宽.来自R预测模块的作者的这篇博客暗示这是有意的,但是:
"均值的置信区间比预测区间窄得多"
Scikit是python的一个很好的模块
X和Y变量必须分成两个数组,如果你要绘制它们(X,Y),一个索引将与另一个数组匹配.
所以我想在你的时间序列数据中你会将X和Y值分开如下:
null = None
time_series = [{"target": "average", "datapoints": [[null, 1435688679], [34.870499801635745, 1435688694], [null, 1435688709], [null, 1435688724], [null, 1435688739], [null, 1435688754], [null, 1435688769], [null, 1435688784], [null, 1435688799], [null, 1435688814], [null, 1435688829], [null, 1435688844], [null, 1435688859], [null, 1435688874], [null, 1435688889], [null, 1435688904], [null, 1435688919], [null, 1435688934], [null, 1435688949], [null, 1435688964], [null, 1435688979], [38.180000209808348, 1435688994], [null, 1435689009], [null, 1435689024], [null, 1435689039], [null, 1435689054], [null, 1435689069], [null, 1435689084], [null, 1435689099], [null, 1435689114], [null, 1435689129], [null, 1435689144], [null, 1435689159], [null, 1435689174], [null, 1435689189], [null, 1435689204], [null, 1435689219], [null, 1435689234], [null, 1435689249], [null, 1435689264], [null, 1435689279], [30.79849989414215, 1435689294], [null, 1435689309], [null, 1435689324], [null, 1435689339], [null, 1435689354], [null, 1435689369], [null, 1435689384], [null, 1435689399], [null, 1435689414], [null, 1435689429], [null, 1435689444], [null, 1435689459], [null, 1435689474], [null, 1435689489], [null, 1435689504], [null, 1435689519], [null, 1435689534], [null, 1435689549], [null, 1435689564]]}]
# assuming the time series is the X axis value
input_X_vars = []
input_Y_vars = []
for pair in time_series[0]["datapoints"]:
if (pair[0] != None and pair[1] != None):
input_X_vars.append(pair[1])
input_Y_vars.append(pair[0])
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model
regr = linear_model.LinearRegression()
regr.fit(input_X_vars, input_Y_vars)
test_X_vars = [1435688681, 1435688685]
results = regr.predict(test_X_vars)
forecast_append = {"target": "Lower", "datapoints": results}
time_series.append(forecast_append)
Run Code Online (Sandbox Code Playgroud)
我将null设置为None,因为'null'关键字被解析为python中的一个变量.
注意:这不是详细的规范答案,而是对适用于您的领域(统计模型)的可用库的引用。
对于Python,你可以使用:
scipy.stats.linregress
scipy.stats.linregress)sklearn: 这是文档。有一些不错的文章:
statsmodels和sklearn。| 归档时间: |
|
| 查看次数: |
14199 次 |
| 最近记录: |