The dataset can be found here: https://www.kaggle.com/mlg-ulb/creditcardfraud
I am trying to use tidymodels to run ranger with 5 fold cross validation on this dataset.
I have have 2 code blocks. The first code block is the original code with the full data. The second code block is almost identical to the first code block, except I have subset a portion of the data so the code runs faster. The second block of code is just to make sure my code works before I run it on the original dataset.
Here is the first code block with the full data:
#load packages
library(tidyverse)
library(tidymodels)
library(tune)
library(workflows)
#load data
df <- read.csv("~creditcard.csv")
#check for NAs and convert Class to factor
anyNA(df)
df$Class <- as.factor(df$Class)
#set seed and split data into training and testing
set.seed(123)
df_split <- initial_split(df)
df_train <- training(df_split)
df_test <- testing(df_split)
#in the training and testing datasets, how many are fraudulent transactions?
df_train %>% count(Class)
df_test %>% count(Class)
#ranger model with 5-fold cross validation
rf_spec <-
rand_forest() %>%
set_engine("ranger", importance = "impurity") %>%
set_mode("classification")
all_wf <-
workflow() %>%
add_formula(Class ~ .) %>%
add_model(rf_spec)
cv_folds <- vfold_cv(df_train, v = 5)
cv_folds
rf_results <-
all_wf %>%
fit_resamples(resamples = cv_folds)
rf_results %>%
collect_metrics()
Run Code Online (Sandbox Code Playgroud)
Here is the second code block with 1,000 rows:
#load packages
library(tidyverse)
library(tidymodels)
library(tune)
library(workflows)
#load data
df <- read.csv("~creditcard.csv")
###################################################################################
#Testing area#
df <- df %>% arrange(-Class) %>% head(1000)
###################################################################################
#check for NAs and convert Class to factor
anyNA(df)
df$Class <- as.factor(df$Class)
#set seed and split data into training and testing
set.seed(123)
df_split <- initial_split(df)
df_train <- training(df_split)
df_test <- testing(df_split)
#in the training and testing datasets, how many are fraudulent transactions?
df_train %>% count(Class)
df_test %>% count(Class)
#ranger model with 5-fold cross validation
rf_spec <-
rand_forest() %>%
set_engine("ranger", importance = "impurity") %>%
set_mode("classification")
all_wf <-
workflow() %>%
add_formula(Class ~ .) %>%
add_model(rf_spec)
cv_folds <- vfold_cv(df_train, v = 5)
cv_folds
rf_results <-
all_wf %>%
fit_resamples(resamples = cv_folds)
rf_results %>%
collect_metrics()
Run Code Online (Sandbox Code Playgroud)
1.) With the the first code block, I can assign and print cv folds in the console. The Global Enviornment data says cv_folds has 5 obs. of 2 variables. When I View(cv_folds), I have columns labeled splits and id, but there are no rows and no data. When I use str(cv_folds), I get the blank loading line that R is "thinking", but there is not a red STOP icon I can push. The only thing I can do is force quit RStudio. Maybe I just need to wait longer? I am not sure. When I do the same thing with the smaller second code block, str() works fine.
2) My overall goal for this project is to split the dataset into training and testing sets. Then partition the training data with 5 fold cross validation and train a ranger model on it. Next, I want to examine the metrics of my model on the training data. Then I want to test my model on the testing set and view the metrics. Eventually, I want to swap out ranger for something like xgboost. Please give me advice on what parts of my code I can add/modify to improve. I am still missing the portion of testing my model on the testing set.
I think the Predictions portion of this article might be what I'm aiming for.
https://rviews.rstudio.com/2019/06/19/a-gentle-intro-to-tidymodels/
Run Code Online (Sandbox Code Playgroud)
3) When I use rf_results %>% collect_metrics(), it only shows accuracy and roc_auc. How do I get sensitivity, specificity, precision, and recall?
4) How do I plot importance? Would I use something like this?
rf_fit <- get_tree_fit(all_wf)
vip::vip(rf_fit, geom = "point")
Run Code Online (Sandbox Code Playgroud)
5) How can I drastically reduce the amount of time for the model to train? Last time I ran ranger with 5 fold cross validation using caret on this dataset, it took 8+ hours (6 core, 4.0 ghz, 16gb RAM, SSD, gtx 1060). I am open to anything (ie. restructure code, AWS computing, parallelization, etc.)
Edit: This is another way I have tried to set this up
#ranger model with 5-fold cross validation
rf_recipe <- recipe(Class ~ ., data = df_train)
rf_engine <-
rand_forest(mtry = tune(), trees = tune(), min_n = tune()) %>%
set_engine("ranger", importance = "impurity") %>%
set_mode("classification")
rf_grid <- grid_random(
mtry() %>% range_set(c(1, 20)),
trees() %>% range_set(c(500, 1000)),
min_n() %>% range_set(c(2, 10)),
size = 30)
all_wf <-
workflow() %>%
add_recipe(rf_recipe) %>%
add_model(rf_engine)
cv_folds <- vfold_cv(df_train, v = 5)
cv_folds
#####
rf_fit <- tune_grid(
all_wf,
resamples = cv_folds,
grid = rf_grid,
metrics = metric_set(roc_auc),
control = control_grid(save_pred = TRUE)
)
collect_metrics(rf_fit)
rf_fit_best <- select_best(rf_fit)
(wf_rf_best <- finalize_workflow(all_wf, rf_fit_best))
Run Code Online (Sandbox Code Playgroud)
我从您的最后一段代码开始,并进行了一些编辑以获得功能性工作流程。我在代码中回答了您的问题。我冒昧地给你一些建议并重新格式化你的代码。
## Packages, seed and data
library(tidyverse)
library(tidymodels)
set.seed(123)
df <- read_csv("creditcard.csv")
df <-
df %>%
arrange(-Class) %>%
head(1000) %>%
mutate(Class = as_factor(Class))
## Modelisation
# Initial split
df_split <- initial_split(df)
df_train <- training(df_split)
df_test <- testing(df_split)
Run Code Online (Sandbox Code Playgroud)
您可以看到df_split返回<750/250/1000>(见下文)。
2)要调整 xgboost 模型,您需要更改的内容很少。
# Models
model_rf <-
rand_forest(mtry = tune(), trees = tune(), min_n = tune()) %>%
set_engine("ranger", importance = "impurity") %>%
set_mode("classification")
model_xgboost <-
boost_tree(mtry = tune(), trees = tune(), min_n = tune()) %>%
set_engine("xgboost", importance = "impurity") %>%
set_mode("classification")
Run Code Online (Sandbox Code Playgroud)
在这里你选择你的超参数网格。我建议您使用非随机网格以最佳方式访问超参数空间。
# Grid of hyperparameters
grid_rf <-
grid_max_entropy(
mtry(range = c(1, 20)),
trees(range = c(500, 1000)),
min_n(range = c(2, 10)),
size = 30)
Run Code Online (Sandbox Code Playgroud)
如您所见,这些是您的工作流程,几乎没有什么可更改的。
# Workflow
wkfl_rf <-
workflow() %>%
add_formula(Class ~ .) %>%
add_model(model_rf)
wkfl_wgboost <-
workflow() %>%
add_formula(Class ~ .) %>%
add_model(model_xgboost)
Run Code Online (Sandbox Code Playgroud)
1) <600/150/750>表示您的训练集中有 600 个观察值,验证集中有 150 个观察值,原始数据集中共有 750 个观察值。请注意,这里 600 + 150 = 750 但情况并非总是如此(例如,使用带有重采样的 boostrap 方法)。
# Cross validation method
cv_folds <- vfold_cv(df_train, v = 5)
cv_folds
Run Code Online (Sandbox Code Playgroud)
3)在这里,您可以使用 yardstik 包选择在调整期间要收集的指标。
# Choose metrics
my_metrics <- metric_set(roc_auc, accuracy, sens, spec, precision, recall)
Run Code Online (Sandbox Code Playgroud)
然后你可以根据网格计算不同的模型。对于控制参数,不要保存预测和打印进度(恕我直言)。
# Tuning
rf_fit <- tune_grid(
wkfl_rf,
resamples = cv_folds,
grid = grid_rf,
metrics = my_metrics,
control = control_grid(verbose = TRUE) # don't save prediction (imho)
)
Run Code Online (Sandbox Code Playgroud)
这些是处理rf_fit对象的一些有用函数。
# Inspect tuning
rf_fit
collect_metrics(rf_fit)
autoplot(rf_fit, metric = "accuracy")
show_best(rf_fit, metric = "accuracy", maximize = TRUE)
select_best(rf_fit, metric = "accuracy", maximize = TRUE)
Run Code Online (Sandbox Code Playgroud)
最后,您可以根据最佳参数拟合模型。
# Fit best model
tuned_model <-
wkfl_rf %>%
finalize_workflow(select_best(rf_fit, metric = "accuracy", maximize = TRUE)) %>%
fit(data = df_train)
predict(tuned_model, df_train)
predict(tuned_model, df_test)
Run Code Online (Sandbox Code Playgroud)
4)不幸的是,处理randomForest对象的方法通常不能用于parnsnip输出
5)您可以查看有关并行化的小插图。