为什么 tidymodels/recipes 中的“id 变量”会起到预测作用？

Question

为什么 tidymodels/recipes 中的“id 变量”会起到预测作用？

这与使用 step_naomit 进行预测并使用 tidymodels 保留 ID是相同的问题，但即使有一个可接受的答案，OP 的最后一条评论指出了“id 变量”被用作预测器的问题，正如在查看时所看到的model$fit$variable.importance.

我有一个我想保留的带有“id 变量”的数据集。我以为我可以通过 recipe() 规范来实现这一点。

library(tidymodels)

# label is an identifier variable I want to keep even though it's not
# a predictor
df <- tibble(label = 1:50, 
             x = rnorm(50, 0, 5), 
             f = factor(sample(c('a', 'b', 'c'), 50, replace = TRUE)),
             y = factor(sample(c('Y', 'N'), 50, replace = TRUE)) )

df_split <- initial_split(df, prop = 0.70)

# Make up any recipe: just note I specify 'label' as "id variable"
rec <- recipe(training(df_split)) %>% 
  update_role(label, new_role = "id variable") %>% 
  update_role(y, new_role = "outcome") %>% 
  update_role(x, new_role = "predictor") %>% 
  update_role(f, new_role = "predictor") %>% 
  step_corr(all_numeric(), -all_outcomes()) %>%
  step_dummy(all_predictors(),-all_numeric()) %>% 
  step_meanimpute(all_numeric(), -all_outcomes())

train_juiced <- prep(rec, training(df_split)) %>% juice()

logit_fit <- logistic_reg(mode = "classification") %>%
  set_engine(engine = "glm") %>% 
  fit(y ~ ., data = train_juiced)

# Why is label a variable in the model ?
logit_fit[['fit']][['coefficients']]
#> (Intercept)       label           x         f_b         f_c 
#>  1.03664140 -0.01405316  0.22357266 -1.80701531 -1.66285399

Run Code Online (Sandbox Code Playgroud)

^{由reprex 包(v0.3.0)于 2020 年 1 月 27 日创建}

但即使我确实指定label了一个 id 变量，它也被用作预测器。所以也许我可以在公式中使用我想要的特定术语，并专门添加label为 id 变量。

rec <- recipe(training(df_split), y ~ x + f) %>% 
  update_role(label, new_role = "id variable") %>% 
  step_corr(all_numeric(), -all_outcomes()) %>%
  step_dummy(all_predictors(),-all_numeric()) %>% 
  step_meanimpute(all_numeric(), -all_outcomes())
#> Error in .f(.x[[i]], ...): object 'label' not found

Run Code Online (Sandbox Code Playgroud)

^{由reprex 包(v0.3.0)于 2020 年 1 月 27 日创建}

我可以试着不提 label

rec <- recipe(training(df_split), y ~ x + f) %>% 
  step_corr(all_numeric(), -all_outcomes()) %>%
  step_dummy(all_predictors(),-all_numeric()) %>% 
  step_meanimpute(all_numeric(), -all_outcomes())


train_juiced <- prep(rec, training(df_split)) %>% juice()

logit_fit <- logistic_reg(mode = "classification") %>%
  set_engine(engine = "glm") %>% 
  fit(y ~ ., data = train_juiced)

# Why is label a variable in the model ?
logit_fit[['fit']][['coefficients']]
#> (Intercept)           x         f_b         f_c 
#> -0.98950228  0.03734093  0.98945339  1.27014824

train_juiced
#> # A tibble: 35 x 4
#>          x y       f_b   f_c
#>      <dbl> <fct> <dbl> <dbl>
#>  1 -0.928  Y         1     0
#>  2  4.54   N         0     0
#>  3 -1.14   N         1     0
#>  4 -5.19   N         1     0
#>  5 -4.79   N         0     0
#>  6 -6.00   N         0     0
#>  7  3.83   N         0     1
#>  8 -8.66   Y         1     0
#>  9 -0.0849 Y         1     0
#> 10 -3.57   Y         0     1
#> # ... with 25 more rows

Run Code Online (Sandbox Code Playgroud)

^{由reprex 包(v0.3.0)于 2020 年 1 月 27 日创建}

好的，模型可以工作了，但是我丢失了label.
我该怎么做？

Answer 1

Jul*_*lge 10

您遇到的主要问题/概念问题是，一旦您juice()编写了配方，它就只是 data，即字面上的数据框。当您使用它来拟合模型时，模型无法知道某些变量具有特殊作用。

library(tidymodels)

# label is an identifier variable to keep even though it's not a predictor
df <- tibble(label = 1:50, 
             x = rnorm(50, 0, 5), 
             f = factor(sample(c('a', 'b', 'c'), 50, replace = TRUE)),
             y = factor(sample(c('Y', 'N'), 50, replace = TRUE)) )

df_split <- initial_split(df, prop = 0.70)

rec <- recipe(y ~ ., training(df_split)) %>% 
  update_role(label, new_role = "id variable") %>% 
  step_corr(all_numeric(), -all_outcomes()) %>%
  step_dummy(all_predictors(),-all_numeric()) %>% 
  step_meanimpute(all_numeric(), -all_outcomes()) %>%
  prep()

train_juiced <- juice(rec)
train_juiced
#> # A tibble: 35 x 5
#>    label     x y       f_b   f_c
#>    <int> <dbl> <fct> <dbl> <dbl>
#>  1     1  1.80 N         1     0
#>  2     3  1.45 N         0     0
#>  3     5 -5.00 N         0     0
#>  4     6 -4.15 N         1     0
#>  5     7  1.37 Y         0     1
#>  6     8  1.62 Y         0     1
#>  7    10 -1.77 Y         1     0
#>  8    11 -3.15 N         0     1
#>  9    12 -2.02 Y         0     1
#> 10    13  2.65 Y         0     1
#> # … with 25 more rows

Run Code Online (Sandbox Code Playgroud)

请注意，这train_juiced实际上只是一个普通的小标题。如果您使用对这个 tibble 训练模型fit()，它不会知道用于转换数据的配方的任何信息。

tidymodels 框架确实有一种方法可以使用配方中的角色信息来训练模型。可能最简单的方法是使用工作流。

logit_spec <- logistic_reg(mode = "classification") %>%
  set_engine(engine = "glm") 

wf <- workflow() %>%
  add_model(logit_spec) %>%
  add_recipe(rec)

logit_fit <- fit(wf, training(df_split))

# No more label in the model
logit_fit
#> ?? Workflow [trained] ??????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????
#> Preprocessor: Recipe
#> Model: logistic_reg()
#> 
#> ?? Preprocessor ????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????
#> 3 Recipe Steps
#> 
#> ? step_corr()
#> ? step_dummy()
#> ? step_meanimpute()
#> 
#> ?? Model ???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????
#> 
#> Call:  stats::glm(formula = formula, family = stats::binomial, data = data)
#> 
#> Coefficients:
#> (Intercept)            x          f_b          f_c  
#>     0.42331     -0.04234     -0.04991      0.64728  
#> 
#> Degrees of Freedom: 34 Total (i.e. Null);  31 Residual
#> Null Deviance:       45 
#> Residual Deviance: 44.41     AIC: 52.41

Run Code Online (Sandbox Code Playgroud)

^{由reprex 包(v0.3.0)于 2020 年 2 月 15 日创建}

模型中不再有标签！

归档时间：	6 年前
查看次数：	784 次
最近记录：	5 年，11 月前