R回归以月为自变量(标签)

Jim*_*myT 5 r

我想知道是否有一种更简洁的方式,而不仅仅是虚拟编码月(例如,isJan,isFeb ......),以获得更有意义的自变量名称(在拦截下).我的数据集相当大,所以我在这里模拟了一个简单的数据集.

#create simulated data set with sales, and date
sales <- rnorm(1000, mean = 1000, sd = 40)
dates <- seq(from = 14610, to = 15609)
data <- cbind(sales, dates)

#regression with months 
model <- lm(sales ~ months(dates))
summary(model) 
Run Code Online (Sandbox Code Playgroud)

我想拦截标签显示他们引用的实际月份...目前我的输出看起来像这样:

                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)      999.1934     1.2673 788.432   <2e-16 ***
months(dates).L   -4.9537     4.5689  -1.084   0.2785    
months(dates).Q   -6.4931     4.4211  -1.469   0.1422    
months(dates).C   -5.5078     4.4180  -1.247   0.2128    
months(dates)^4    2.3713     4.4864   0.529   0.5972    
months(dates)^5   -1.7749     4.4605  -0.398   0.6908    
months(dates)^6    1.5774     4.4555   0.354   0.7234    
months(dates)^7  -10.9954     4.4511  -2.470   0.0137 *  
months(dates)^8   -0.9627     4.4032  -0.219   0.8270    
months(dates)^9    1.8847     4.2996   0.438   0.6612    
months(dates)^10  -8.5990     4.1776  -2.058   0.0398 *  
months(dates)^11   7.8436     4.1292   1.900   0.0578 . 
Run Code Online (Sandbox Code Playgroud)

在此先感谢, - .JT

Rei*_*son 6

你遇到的问题是R已经创建了一个有序因子,并且对于有序因子产生的对比与多项式形成对比(.L是线性的,.Q是二次的,.C立方的并且.^n是n阶多项式.将月定义为一个月可能更好.因子,将第一级设置为1月,然后拟合模型.

如果在英语语言环境中,那么我们可以使用month.namemonth.abb常量如下

set.seed(42)
dat <- data.frame(sales = rnorm(1000, mean = 1000, sd = 40),
                  dates = as.Date(seq(from = 14610, to = 15609),
                                  origin = "1970-01-01"))
dat <- transform(dat, month = factor(format(dates, format = "%B"),
                                     levels = month.name))
Run Code Online (Sandbox Code Playgroud)

这给了

> head(dat)
      sales      dates   month
1 1054.8383 2010-01-01 January
2  977.4121 2010-01-02 January
3 1014.5251 2010-01-03 January
4 1025.3145 2010-01-04 January
5 1016.1707 2010-01-05 January
6  995.7550 2010-01-06 January
> with(dat, levels(month))
 [1] "January"   "February"  "March"     "April"     "May"      
 [6] "June"      "July"      "August"    "September" "October"  
[11] "November"  "December"
Run Code Online (Sandbox Code Playgroud)

请注意,级别的顺序是逻辑顺序而不是字母顺序.如果您使用的是非英语语言环境,则输出"%B"将是您当地语言或约定中的月份名称.然后,您需要为levels上面代码中的参数提供正确的级别作为字符向量.

然后可以使用该数据集来拟合模型,并获得更有意义的系数名称

> mod <- lm(sales ~ month, data = dat)
> summary(mod)

Call:
lm(formula = sales ~ month, data = dat)

Residuals:
     Min       1Q   Median       3Q      Max 
-140.333  -24.551    0.108   28.102  134.349 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)    1001.7034     4.1567 240.983   <2e-16 ***
monthFebruary    -8.3618     6.0153  -1.390    0.165    
monthMarch       -0.5347     5.8785  -0.091    0.928    
monthApril       -7.5618     5.9273  -1.276    0.202    
monthMay         -2.2961     5.8785  -0.391    0.696    
monthJune         3.5091     5.9273   0.592    0.554    
monthJuly        -4.9975     5.8785  -0.850    0.395    
monthAugust      -0.3558     5.8785  -0.061    0.952    
monthSeptember    3.7597     5.9970   0.627    0.531    
monthOctober     -2.5948     6.5724  -0.395    0.693    
monthNovember   -10.5670     6.6378  -1.592    0.112    
monthDecember    -6.9064     6.5724  -1.051    0.294    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 40.09 on 988 degrees of freedom
Multiple R-squared: 0.01173,    Adjusted R-squared: 0.0007317 
F-statistic: 1.066 on 11 and 988 DF,  p-value: 0.3854
Run Code Online (Sandbox Code Playgroud)

在上文中,请注意1月是第一个水平,因此其平均值是(Intercept)估计值,其他估计值与1月平均值的偏差.模型的另一个参数化是抑制截距:

> mod2 <- lm(sales ~ month - 1, data = dat)
> summary(mod2)

Call:
lm(formula = sales ~ month - 1, data = dat)

Residuals:
     Min       1Q   Median       3Q      Max 
-140.333  -24.551    0.108   28.102  134.349 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
monthJanuary   1001.703      4.157   241.0   <2e-16 ***
monthFebruary   993.342      4.348   228.5   <2e-16 ***
monthMarch     1001.169      4.157   240.9   <2e-16 ***
monthApril      994.142      4.225   235.3   <2e-16 ***
monthMay        999.407      4.157   240.4   <2e-16 ***
monthJune      1005.213      4.225   237.9   <2e-16 ***
monthJuly       996.706      4.157   239.8   <2e-16 ***
monthAugust    1001.348      4.157   240.9   <2e-16 ***
monthSeptember 1005.463      4.323   232.6   <2e-16 ***
monthOctober    999.109      5.091   196.3   <2e-16 ***
monthNovember   991.136      5.175   191.5   <2e-16 ***
monthDecember   994.797      5.091   195.4   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 40.09 on 988 degrees of freedom
Multiple R-squared: 0.9984, Adjusted R-squared: 0.9984 
F-statistic: 5.175e+04 on 12 and 988 DF,  p-value: < 2.2e-16
Run Code Online (Sandbox Code Playgroud)

现在,估计数是月度均值,而t检验是个体月均值为零(0)的假设.