我在时间序列中使用lm,实际上运行得非常好,并且它超级超级快.
假设我的模型是:
> formula <- y ~ x
Run Code Online (Sandbox Code Playgroud)
我在训练集上训练这个:
> train <- data.frame( x = seq(1,3), y = c(2,1,4) )
> model <- lm( formula, train )
Run Code Online (Sandbox Code Playgroud)
......我可以预测新数据:
> test <- data.frame( x = seq(4,6) )
> test$y <- predict( model, newdata = test )
> test
x y
1 4 4.333333
2 5 5.333333
3 6 6.333333
Run Code Online (Sandbox Code Playgroud)
这非常好用,而且速度非常快.
我想将滞后变量添加到模型中.现在,我可以通过扩充我原来的训练集来做到这一点:
> train$y_1 <- c(0,train$y[1:nrow(train)-1])
> train
x y y_1
1 1 2 0
2 2 1 2
3 3 4 1
Run Code Online (Sandbox Code Playgroud)
更新公式:
formula <- y ~ x * y_1
Run Code Online (Sandbox Code Playgroud)
......培训工作正常:
> model <- lm( formula, train )
> # no errors here
Run Code Online (Sandbox Code Playgroud)
但是,问题在于无法使用"预测",因为无法以批处理方式在测试集中填充y_1.
现在,对于许多其他回归事物,有很方便的方法在公式中表达它们,例如poly(x,2)等等,这些方法直接使用未经修改的训练和测试数据.
所以,我想知道在公式中是否有某种表达滞后变量的方法,这样predict可以使用吗?理想的情况是:
formula <- y ~ x * lag(y,-1)
model <- lm( formula, train )
test$y <- predict( model, newdata = test )
Run Code Online (Sandbox Code Playgroud)
...无需增加(不确定这是否是正确的词)训练和测试数据集,只是能够predict直接使用?
Dir*_*tel 14
看看例如dynlm包,它给你延迟操作符.更一般地说,计量经济学和时间序列的任务视图将有更多的东西供您查看.
以下是其示例的开头 - 一个月和十二个月的滞后:
R> data("UKDriverDeaths", package = "datasets")
R> uk <- log10(UKDriverDeaths)
R> dfm <- dynlm(uk ~ L(uk, 1) + L(uk, 12))
R> dfm
Time series regression with "ts" data:
Start = 1970(1), End = 1984(12)
Call:
dynlm(formula = uk ~ L(uk, 1) + L(uk, 12))
Coefficients:
(Intercept) L(uk, 1) L(uk, 12)
0.183 0.431 0.511
R>
Run Code Online (Sandbox Code Playgroud)
根据Dirk的建议dynlm,我无法弄清楚如何预测,但搜索到这导致我dyn通过https://stats.stackexchange.com/questions/6758/1-step-ahead-predictions-with-dynlm打包-r包
然后经过几个小时的实验,我想出了以下函数来处理预测.路上有很多'问题',例如你看起来不像rbind时间序列,而且预测的结果被start一堆类似的东西所抵消,所以我觉得这个答案与命名一个相比显着增加了包,虽然我赞成了Dirk的回答.
因此,一个有效的解决方案是:
dyn包predictDyn方法:
# pass in training data, test data,
# it will step through one by one
# need to give dependent var name, so that it can make this into a timeseries
predictDyn <- function( model, train, test, dependentvarname ) {
Ntrain <- nrow(train)
Ntest <- nrow(test)
# can't rbind ts's apparently, so convert to numeric first
train[,dependentvarname] <- as.numeric(train[,dependentvarname])
test[,dependentvarname] <- as.numeric(test[,dependentvarname])
testtraindata <- rbind( train, test )
testtraindata[,dependentvarname] <- ts( as.numeric( testtraindata[,dependentvarname] ) )
for( i in 1:Ntest ) {
result <- predict(model,newdata=testtraindata,subset=1:(Ntrain+i-1))
testtraindata[Ntrain+i,dependentvarname] <- result[Ntrain + i + 1 - start(result)][1]
}
return( testtraindata[(Ntrain+1):(Ntrain + Ntest),] )
}
Run Code Online (Sandbox Code Playgroud)
用法示例:
library("dyn")
# size of training and test data
N <- 6
predictN <- 10
# create training data, which we can get exact fit on, so we can check the results easily
traindata <- c(1,2)
for( i in 3:N ) { traindata[i] <- 0.5 + 1.3 * traindata[i-2] + 1.7 * traindata[i-1] }
train <- data.frame( y = ts( traindata ), foo = 1)
# create testing data, bunch of NAs
test <- data.frame( y = ts( rep(NA,predictN) ), foo = 1)
# fit a model
model <- dyn$lm( y ~ lag(y,-1) + lag(y,-2), train )
# look at the model, it's a perfect fit. Nice!
print(model)
test <- predictDyn( model, train, test, "y" )
print(test)
# nice plot
plot(test$y, type='l')
Run Code Online (Sandbox Code Playgroud)
输出:
> model
Call:
lm(formula = dyn(y ~ lag(y, -1) + lag(y, -2)), data = train)
Coefficients:
(Intercept) lag(y, -1) lag(y, -2)
0.5 1.7 1.3
> test
y foo
7 143.2054 1
8 325.6810 1
9 740.3247 1
10 1682.4373 1
11 3823.0656 1
12 8686.8801 1
13 19738.1816 1
14 44848.3528 1
15 101902.3358 1
16 231537.3296 1
Run Code Online (Sandbox Code Playgroud)
编辑:嗯,这是超级慢.即使我将数据限制为数据subset集的常量几行,每个预测大约需要24毫秒,或者,对于我的任务,0.024*7*24*8*20*10/60/60= 1.792 hours: - O