modelr:使用重采样数据拟合多个模型

tch*_*rty 7 r dplyr tidyr purrr

在实施的数据科学(TM)整洁模型modelr,使用列表列组织重采样数据:

library(modelr)
library(tidyverse)

# create the k-folds
df_heights_resampled = heights %>% 
  crossv_kfold(k = 10, id = "Resample ID")
Run Code Online (Sandbox Code Playgroud)

可以为map列表列中的每个训练数据集建立模型,train并通过mapping到列表列来计算性能指标test.

如果需要使用多个模型,则需要对每个模型重复此操作.

# create a list of formulas 
formulas_heights = formulas(
  .response = ~ income, 
  model1  = ~ height + weight + marital + sex,
  model2 = ~ height + weight + marital + sex + education
) 

# fit each of the models in the list of formulas
df_heights_resampled = df_heights_resampled %>% 
  mutate(
    model1 = map(train, function(train_data) {
      lm(formulas_heights[[1]], data = train_data)
    }),
    model2 = map(train, function(train_data) {
      lm(formulas_heights[[2]], data = train_data)
    })
  )

# score the models on the test sets
df_heights_resampled = df_heights_resampled %>% 
  mutate(
    rmse1 = map2_dbl(.x = model1, .y = test, .f = rmse),
    rmse2 = map2_dbl(.x = model2, .y = test, .f = rmse)
  )
Run Code Online (Sandbox Code Playgroud)

这使:

> df_heights_resampled
# A tibble: 10 × 7
            train           test `Resample ID`   model1   model2    rmse1    rmse2
           <list>         <list>         <chr>   <list>   <list>    <dbl>    <dbl>
1  <S3: resample> <S3: resample>            01 <S3: lm> <S3: lm> 58018.35 53903.99
2  <S3: resample> <S3: resample>            02 <S3: lm> <S3: lm> 55117.37 50279.38
3  <S3: resample> <S3: resample>            03 <S3: lm> <S3: lm> 49005.82 44613.93
4  <S3: resample> <S3: resample>            04 <S3: lm> <S3: lm> 55437.07 51068.90
5  <S3: resample> <S3: resample>            05 <S3: lm> <S3: lm> 48845.35 44673.88
6  <S3: resample> <S3: resample>            06 <S3: lm> <S3: lm> 58226.69 54010.50
7  <S3: resample> <S3: resample>            07 <S3: lm> <S3: lm> 56571.93 53322.41
8  <S3: resample> <S3: resample>            08 <S3: lm> <S3: lm> 46084.82 42294.50
9  <S3: resample> <S3: resample>            09 <S3: lm> <S3: lm> 59762.22 54814.55
10 <S3: resample> <S3: resample>            10 <S3: lm> <S3: lm> 45328.48 41882.79
Run Code Online (Sandbox Code Playgroud)

题:

如果要探索的模型数量很大,这可能会非常快.modelr提供fit_with允许迭代多个模型的函数(由多个公式表征),但似乎不允许像train上面模型中那样的列表列.我假设其中一个*map*函数系列将使这成为可能(invoke_map?),但还是无法弄清楚如何.

Axe*_*man 3

map您可以使用和以编程方式构建调用lazyeval::interp。我很好奇是否有一个纯粹的purrr解决方案,但问题是您想要创建多个列,并且需要多次调用。也许purrr解决方案会创建另一个包含所有模型的列表列。

\n\n
library(lazyeval)\nmodel_calls <- map(formulas_heights, \n                   ~interp(~map(train, ~lm(form, data = .x)), form = .x))\nscore_calls <- map(names(model_calls), \n                   ~interp(~map2_dbl(.x = m, .y = test, .f = rmse), m = as.name(.x)))\nnames(score_calls) <- paste0("rmse", seq_along(score_calls))\n\ndf_heights_resampled %>% mutate_(.dots = c(model_calls, score_calls))\n
Run Code Online (Sandbox Code Playgroud)\n\n
\n
# A tibble: 10 \xc3\x97 7\n            train           test `Resample ID`   model1   model2    rmse1    rmse2\n           <list>         <list>         <chr>   <list>   <list>    <dbl>    <dbl>\n1  <S3: resample> <S3: resample>            01 <S3: lm> <S3: lm> 44720.86 41452.07\n2  <S3: resample> <S3: resample>            02 <S3: lm> <S3: lm> 54174.38 48823.03\n3  <S3: resample> <S3: resample>            03 <S3: lm> <S3: lm> 56854.21 52725.62\n4  <S3: resample> <S3: resample>            04 <S3: lm> <S3: lm> 53312.38 48797.48\n5  <S3: resample> <S3: resample>            05 <S3: lm> <S3: lm> 61883.90 57469.17\n6  <S3: resample> <S3: resample>            06 <S3: lm> <S3: lm> 55709.83 50867.26\n7  <S3: resample> <S3: resample>            07 <S3: lm> <S3: lm> 53036.06 48698.07\n8  <S3: resample> <S3: resample>            08 <S3: lm> <S3: lm> 55986.83 52717.94\n9  <S3: resample> <S3: resample>            09 <S3: lm> <S3: lm> 51738.60 48006.74\n10 <S3: resample> <S3: resample>            10 <S3: lm> <S3: lm> 45061.22 41480.35\n
Run Code Online (Sandbox Code Playgroud)\n
\n