RNB*_*RNB 1 bootstrapping r broom tidyverse modelr
我有一个长数据集,它由多个插补(假设10个插补)产生的几个数据集组成。它们具有标识插补的id变量。我想在每个这些估算数据集上引导10个数据集。引导程序完成后,我要在每个模型上运行模型(100个插补引导程序组合)。
在此示例中,我不确定是使用该broom::bootstrap()功能还是该modelr::bootstrap()功能。此外,分组似乎在我的管道中丢失了。
这是使用mtcars数据集的可重现示例:
library(tidyverse)
library(broom)
cars <- mtcars %>%
mutate(am = as.factor(am)) %>% # This is standing in for my imputation id variable
group_by(am)
Source: local data frame [32 x 11]
Groups: am [2]
# A tibble: 32 x 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fctr> <dbl> <dbl>
1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Run Code Online (Sandbox Code Playgroud)
如您所见,输出当前显示出应该有两个组。在我的数据集中,每个估算数据集将显示10个。现在:
cars2 <- cars %>%
broom::bootstrap(10, by_group = TRUE)
cars2
Source: local data frame [32 x 11]
Groups: replicate [10]
# A tibble: 32 x 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fctr> <dbl> <dbl>
1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Run Code Online (Sandbox Code Playgroud)
现在看起来好像只有10个组代表每个副本。它似乎没有保留先前的分组。在这一点上,我希望共有20个群组(2 x 10)。
如果我现在这样做:
cars3 <- cars2 %>%
group_by(am)
cars3
Source: local data frame [32 x 11]
Groups: am [2]
# A tibble: 32 x 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fctr> <dbl> <dbl>
1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Run Code Online (Sandbox Code Playgroud)
现在看来,没有重复项仅针对am。
在对原始数据集进行分组之后,是否仍然需要进行引导。同样,理想情况下,在我引导之后,应该有一个ID指示我正在查看哪个引导数据集。
在理想的世界中,我的代码应该能够执行以下操作:
cars <- mtcars %>%
mutate(am = as.factor(am)) %>%
group_by(am) %>%
bootstrap(10, by_group = TRUE) %>%
nest() %>% # create a condensed tidy dataset that has one row per imputation, bootstrap combo
mutate(model = map(data, ~lm(mpg~, data = .)) # Create a model for each row
Run Code Online (Sandbox Code Playgroud)
我正努力学习两者modelr,purrr而它们确实让我头疼。我想我终于想通了。
library(modelr)
library(dplyr)
library(tidyr)
library(broom)
Run Code Online (Sandbox Code Playgroud)
mtcars %>% group_by(am) %>%
do(rs = modelr::bootstrap(., 10))
Run Code Online (Sandbox Code Playgroud)
Run Code Online (Sandbox Code Playgroud)Source: local data frame [2 x 2] Groups: <by row> # A tibble: 2 x 2 am rs * <dbl> <list> 1 0 <tibble [10 x 2]> 2 1 <tibble [10 x 2]>
mtcars %>% group_by(am) %>%
do(rs = modelr::bootstrap(., 10)) %>%
group_by(am) %>%
unnest
Run Code Online (Sandbox Code Playgroud)
Run Code Online (Sandbox Code Playgroud)# A tibble: 20 x 3 # Groups: am [2] am strap .id <dbl> <list> <chr> 1 0 <S3: resample> 01 2 0 <S3: resample> 02 3 0 <S3: resample> 03 4 0 <S3: resample> 04 5 0 <S3: resample> 05 6 0 <S3: resample> 06 7 0 <S3: resample> 07 8 0 <S3: resample> 08 9 0 <S3: resample> 09 10 0 <S3: resample> 10 11 1 <S3: resample> 01 12 1 <S3: resample> 02 13 1 <S3: resample> 03 14 1 <S3: resample> 04 15 1 <S3: resample> 05 16 1 <S3: resample> 06 17 1 <S3: resample> 07 18 1 <S3: resample> 08 19 1 <S3: resample> 09 20 1 <S3: resample> 10
您必须as.data.frame在“ 带子”列上使用才能将其重新扩展为可用数据。请参阅?resample。这使我永远想不通。它应该只是工作一样tidyr::unnest。
mtcars %>% group_by(am) %>%
do(rs = modelr::bootstrap(., 10)) %>%
group_by(am) %>%
unnest %>%
group_by(am, .id) %>%
do(model = lm(mpg~wt, data = as.data.frame(.$strap)))
Run Code Online (Sandbox Code Playgroud)
Run Code Online (Sandbox Code Playgroud)Source: local data frame [20 x 3] Groups: <by row> # A tibble: 20 x 3 am .id model * <dbl> <chr> <list> 1 0 01 <S3: lm> 2 0 02 <S3: lm> 3 0 03 <S3: lm> 4 0 04 <S3: lm> 5 0 05 <S3: lm> 6 0 06 <S3: lm> 7 0 07 <S3: lm> 8 0 08 <S3: lm> 9 0 09 <S3: lm> 10 0 10 <S3: lm> 11 1 01 <S3: lm> 12 1 02 <S3: lm> 13 1 03 <S3: lm> 14 1 04 <S3: lm> 15 1 05 <S3: lm> 16 1 06 <S3: lm> 17 1 07 <S3: lm> 18 1 08 <S3: lm> 19 1 09 <S3: lm> 20 1 10 <S3: lm>
mtcars %>% group_by(am) %>%
do(rs = modelr::bootstrap(., 10)) %>%
group_by(am) %>%
unnest %>%
group_by(am, .id) %>%
do(model = lm(mpg~wt, data = as.data.frame(.$strap))) %>%
tidy(model)
Run Code Online (Sandbox Code Playgroud)
Run Code Online (Sandbox Code Playgroud)# A tibble: 40 x 7 # Groups: am, .id [20] am .id term estimate std.error statistic p.value <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> 1 0 01 (Intercept) 25.800592 2.1055145 12.253818 7.300379e-10 2 0 01 wt -2.608827 0.5377694 -4.851201 1.497729e-04 3 0 02 (Intercept) 37.012664 4.7369213 7.813654 5.023424e-07 4 0 02 wt -5.272094 1.2884870 -4.091693 7.602571e-04 5 0 03 (Intercept) 26.145563 2.2114832 11.822637 1.263234e-09 6 0 03 wt -2.428845 0.5541412 -4.383080 4.056524e-04 7 0 04 (Intercept) 31.502481 4.0753463 7.730013 5.806324e-07 8 0 04 wt -3.584863 1.1510368 -3.114464 6.305972e-03 9 0 05 (Intercept) 31.739921 2.2216473 14.286661 6.690920e-11 10 0 05 wt -3.716515 0.5627808 -6.603841 4.471168e-06 # ... with 30 more rows
请注意,我将引导程序的数量增加到1000,大约需要10秒。
mtcars %>% group_by(am) %>%
do(rs = modelr::bootstrap(., 1000)) %>%
group_by(am) %>%
unnest %>%
group_by(am, .id) %>%
do(model = lm(mpg~wt, data = as.data.frame(.$strap))) %>%
tidy(model) %>%
ggplot(aes(estimate)) + geom_histogram() +
facet_grid(am ~ term, scales = "free_x", labeller = label_both)
Run Code Online (Sandbox Code Playgroud)