通过`do`平滑每一组

Hug*_*ugh 3 r dplyr

我有一些数据,下面是一个样本.我的目标是对gam每个年份应用a ,并使用另一个值作为gam模型的预测值.

fertility <- structure(list(AGE = c(15L, 16L, 17L, 18L, 19L, 20L, 21L, 22L, 
23L, 24L, 25L, 26L, 27L, 28L, 29L, 30L, 31L, 32L, 33L, 34L, 35L, 
36L, 37L, 38L, 39L, 40L, 41L, 42L, 43L, 44L, 15L, 16L, 17L, 18L, 
19L, 20L, 21L, 22L, 23L, 24L, 25L, 26L, 27L, 28L, 29L, 30L, 31L, 
32L, 33L, 34L, 35L, 36L, 37L, 38L, 39L, 40L, 41L, 42L, 43L, 44L
), Year = c(1930, 1930, 1930, 1930, 1930, 1930, 1930, 1930, 1930, 
1930, 1930, 1930, 1930, 1930, 1930, 1930, 1930, 1930, 1930, 1930, 
1930, 1930, 1930, 1930, 1930, 1930, 1930, 1930, 1930, 1930, 1931, 
1931, 1931, 1931, 1931, 1931, 1931, 1931, 1931, 1931, 1931, 1931, 
1931, 1931, 1931, 1931, 1931, 1931, 1931, 1931, 1931, 1931, 1931, 
1931, 1931, 1931, 1931, 1931, 1931, 1931), fertility = c(5.170284269, 
14.18135114, 27.69795144, 44.61216712, 59.08896308, 89.66036496, 
105.4563852, 120.1754041, 137.4074262, 148.7159407, 161.5645606, 
157.200515, 143.6340251, 127.8855125, 117.7343628, 159.2909484, 
126.6158821, 109.0681613, 86.98223678, 70.64470361, 111.0070633, 
86.15051988, 68.9204159, 55.92722274, 42.93402958, 56.84376018, 
39.35337243, 26.72142573, 18.46207596, 9.231037978, 4.769704534, 
13.08261815, 25.55198857, 41.15573626, 54.51090896, 81.99522459, 
96.44082973, 109.9015072, 125.6603492, 136.0020892, 148.679958, 
144.6639404, 132.1793638, 117.6867783, 108.345172, 144.2820726, 
114.68575, 98.79142865, 78.7865069, 63.9883456, 100.217918, 77.77726461, 
62.22181169, 50.49147014, 38.76112859, 52.48807067, 36.33789508, 
24.67387938, 17.04740757, 8.523703784)), class = c("tbl_df", 
"tbl", "data.frame"), row.names = c(NA, -60L), .Names = c("AGE", 
"Year", "fertility"))
Run Code Online (Sandbox Code Playgroud)

因此,非dplyr,"愚蠢"的方式来做到这一点

count <- 0
for (i in 1930:1931){
  count <- count + 1
  temp <- filter(fertility, Year == i)
  mod <- mgcv::gam(fertility ~ s(AGE), data=temp)
  pred[length(15:44) * (count - 1) + 1:30] <- predict(mod, newdata = data.frame(AGE = 15:44))
}

fertility1 <- mutate(fertility, pred = pred)
Run Code Online (Sandbox Code Playgroud)

但我想要一种方法dplyr.我的想法是用来do为每列创建一个模型,然后predict用来获取值.我可以做的第一步,但我正在努力实现第二部分dplyr:

library(mgcv)
library(dplyr)

  fertility %>%
    #filter(!is.na(fertility)) %>%  # not sure if this is necessary
    group_by(Year) %>%
    dplyr::do(model = mgcv::gam(fertility ~ s(AGE), data = .)) %>%
    left_join(fertility, .) %>%
    mutate(smoothed = predict(model, newdata = AGE))
Run Code Online (Sandbox Code Playgroud)

我收到错误消息

Error in UseMethod("predict") : 
  no applicable method for 'predict' applied to an object of class "list"
Run Code Online (Sandbox Code Playgroud)

这可能意味着dplyr不记得那model是一个模型,而不仅仅是一个列表元素.

Rei*_*son 10

智能的方式来做到这一点是使用因子平稳的相互作用已在已经提供mgcv的年龄,或者通过by在条件s()或通过新的bs = "fs"基础类型.以下是您的数据示例:

library("mgcv")
## Make Year a factor
fertility <- transform(fertility, Year = factor(Year))
## Fit model using by terms - include factor as fixed effect too!
mod <- gam(fertility ~ Year + s(AGE, by = Year), data = fertility)
## Plot to see what form this model takes
plot(mod, pages = 1)
Run Code Online (Sandbox Code Playgroud)

在此输入图像描述

## Some prediction data
ages <- with(fertility, seq(min(AGE), max(AGE)))
## Need to replicate this once per Year
pdat <- with(fertility,
             data.frame(AGE = rep(ages, nlevels(Year)),
                        Year = rep(levels(Year), each = length(ages))))
## Add the fitted values to the prediction data
pdat <- transform(pdat, fitted = predict(mod, newdata = pdat))
head(pdat)

> head(pdat)
  AGE Year     fitted
1  15 1930 -0.8496705
2  16 1930 15.9568574
3  17 1930 33.0754019
4  18 1930 50.7419122
5  19 1930 68.9116594
6  20 1930 87.1306489
Run Code Online (Sandbox Code Playgroud)

但是,如果您想要做的就是预测观察到的值,您可以询问拟合值AGES:

fertility <- transform(fertility, fitted = predict(mod))
head(fertility)

> head(fertility)
  AGE Year fertility     fitted
1  15 1930  5.170284 -0.8496705
2  16 1930 14.181351 15.9568574
3  17 1930 27.697951 33.0754019
4  18 1930 44.612167 50.7419122
5  19 1930 59.088963 68.9116594
6  20 1930 89.660365 87.1306489
Run Code Online (Sandbox Code Playgroud)

您还可以看看具体的因子平稳的基础型bs = "fs"?smooth.terms?factor.smooth.interaction对细节; 基本上这些是有效的,如果你有很多级别,但你希望每个级别的平滑器具有相同的平滑参数值.

这里的主要优点是,你使用的所有数据和适应一个单一的模式,然后你就可以在多种方式查询不会轻易向你敞开,如果你适合中号不同的模型,如能调查每平整器的差异年.