使用 tidymodels 工作流程预测测试集时出现错误:“列中缺少数据”

Kim*_*m.L 0 r machine-learning feature-engineering tidymodels

最近我学习使用 tidymodels 来构建机器学习工作流程,但是当我使用该工作流程对测试集进行预测时,它会引发错误“列中缺少数据”,但我确信训练集和测试集都没有有缺失数据。这是我的代码和示例:

\n
# Imformation of the data\xef\xbc\x9athe Primary_type in test set has several novel levels\nstr(train_sample)\ntibble [500,000 x 3] (S3: tbl_df/tbl/data.frame)\n $ ID          : num [1:500000] 6590508 2902772 6162081 7777470 7134849 ...\n $ Primary_type: Factor w/ 29 levels "ARSON","ASSAULT",..: 16 8 3 3 28 7 3 4 25 15 ...\n $ Arrest      : Factor w/ 2 levels "FALSE","TRUE": 2 1 1 1 1 2 1 1 1 1 ...\n\nstr(test_sample)\ntibble [300,000 x 3] (S3: tbl_df/tbl/data.frame)\n $ ID          : num [1:300000] 8876633 9868538 9210518 9279377 8707153 ...\n $ Primary_type: Factor w/ 32 levels "ARSON","ASSAULT",..: 3 7 31 7 2 8 7 2 31 18 ...\n $ Arrest      : Factor w/ 2 levels "FALSE","TRUE": 1 1 1 1 2 1 1 1 2 2 ...\n\n# set the recipe\nrec <- recipe(Arrest ~ ., data = train_sample) %>% \n  update_role(ID, new_role = "ID") %>% \n  step_novel(Primary_type)\n\n# set the model\nrf_model <- rand_forest(trees = 10) %>%\n  set_engine("ranger", seed = 100, num.threads = 12, verbose = TRUE) %>%\n  set_mode("classification")\n\n# set the workflow\nwf <- workflow() %>% \n  add_recipe(rec) %>% \n  add_model(rf_model)\n\n# fit the train data\nwf_fit <- wf %>% fit(train_sample)\n\n# predict the test data\nwf_pred <- wf_fit %>% predict(test_sample)\n
Run Code Online (Sandbox Code Playgroud)\n

该预测引发了以下错误:

\n
ERROR:Missing data in columns: Primary_type.\n
Run Code Online (Sandbox Code Playgroud)\n

但是,当我使用prep()Baker()分别构建工作流程时,预测不会引发错误:

\n
# set the workflow seperately\ntrain_prep <- prep(rec, training = train_sample)\ntrain_bake <- bake(train_prep, new_data = NULL)\ntest_bake <- bake(train_prep, new_data = test_sample)\n\n# fit the baked train data\nrf_model_fit <- rf_model %>% fit(Arrest ~ Primary_type, train_bake)\n\n# predict the baked test data\nrf_model_pred <- rf_model_fit %>% predict(test_bake) # No missing data error\n
Run Code Online (Sandbox Code Playgroud)\n

我发现两个烘焙数据集中 Primary_type 的级别是相同的,这意味着step_novel()有效。

\n
# compare the levels bewteen baked data sets\nidentical(levels(train_bake$Primary_type), levels(test_bake$Primary_type))\n[1] TRUE\n
Run Code Online (Sandbox Code Playgroud)\n

那么,为什么在工作流程中预测失败而在单独预测时成功呢?缺失的数据是如何产生的?多谢。

\n

Jul*_*lge 6

我建议您查看有关“步骤顺序”的建议,尤其是有关处理分类数据级别的部分。step_novel()您应该在其他因素处理操作之前使用。