我想使用dplyr为每小时(因子变量)拟合一个模型,我得到一个错误,我不太确定什么是错的.
df.h <- data.frame(
hour = factor(rep(1:24, each = 21)),
price = runif(504, min = -10, max = 125),
wind = runif(504, min = 0, max = 2500),
temp = runif(504, min = - 10, max = 25)
)
df.h <- tbl_df(df.h)
df.h <- group_by(df.h, hour)
group_size(df.h) # checks out, 21 obs. for each factor variable
# different attempts:
reg.models <- do(df.h, formula = price ~ wind + temp)
reg.models <- do(df.h, .f = lm(price ~ wind + temp, data = df.h))
Run Code Online (Sandbox Code Playgroud)
我尝试了各种各样的变化,但我无法让它发挥作用.
tch*_*rty 85
最简单的方法是在2015年5月左右使用broom.broom包含三个函数,用于处理来自组的统计操作的复杂返回对象:( tidy处理来自组的统计操作的系数向量),glance(处理来自组的统计操作的汇总统计),以及augment(处理来自按组进行的统计操作).
以下是用于将各组的线性回归的各种结果提取到整理中data_frame的示例.
tidy:
library(dplyr)
library(broom)
df.h = data.frame(
hour = factor(rep(1:24, each = 21)),
price = runif(504, min = -10, max = 125),
wind = runif(504, min = 0, max = 2500),
temp = runif(504, min = - 10, max = 25)
)
dfHour = df.h %>% group_by(hour) %>%
do(fitHour = lm(price ~ wind + temp, data = .))
# get the coefficients by group in a tidy data_frame
dfHourCoef = tidy(dfHour, fitHour)
dfHourCoef
Run Code Online (Sandbox Code Playgroud)
这使,
Source: local data frame [72 x 6]
Groups: hour
hour term estimate std.error statistic p.value
1 1 (Intercept) 53.336069324 21.33190104 2.5002961 0.022294293
2 1 wind -0.008475175 0.01338668 -0.6331053 0.534626575
3 1 temp 1.180019541 0.79178607 1.4903262 0.153453756
4 2 (Intercept) 77.737788772 23.52048754 3.3051096 0.003936651
5 2 wind -0.008437212 0.01432521 -0.5889765 0.563196358
6 2 temp -0.731265113 1.00109489 -0.7304653 0.474506855
7 3 (Intercept) 38.292039924 17.55361626 2.1814331 0.042655670
8 3 wind 0.005422492 0.01407478 0.3852630 0.704557388
9 3 temp 0.426765270 0.83672863 0.5100402 0.616220435
10 4 (Intercept) 30.603119492 21.05059583 1.4537888 0.163219027
.. ... ... ... ... ... ...
Run Code Online (Sandbox Code Playgroud)augment:
# get the predictions by group in a tidy data_frame
dfHourPred = augment(dfHour, fitHour)
dfHourPred
Run Code Online (Sandbox Code Playgroud)
这使,
Source: local data frame [504 x 11]
Groups: hour
hour price wind temp .fitted .se.fit .resid .hat .sigma .cooksd .std.resid
1 1 83.8414055 67.3780 -6.199231 45.44982 22.42649 38.391590 0.27955950 42.24400 0.1470891067 1.0663820
2 1 0.3061628 2073.7540 15.134085 53.61916 14.10041 -53.312993 0.11051343 41.43590 0.0735584714 -1.3327207
3 1 80.3790032 520.5949 24.711938 78.08451 20.03558 2.294497 0.22312869 43.64059 0.0003606305 0.0613746
4 1 121.9023855 1618.0864 12.382588 54.23420 10.31293 67.668187 0.05911743 40.23212 0.0566557575 1.6447224
5 1 -0.4039594 1542.8150 -5.544927 33.71732 14.53349 -34.121278 0.11740628 42.74697 0.0325125137 -0.8562896
6 1 29.8269832 396.6951 6.134694 57.21307 16.04995 -27.386085 0.14318542 43.05124 0.0271028701 -0.6975290
7 1 -7.1865483 2009.9552 -5.657871 29.62495 16.93769 -36.811497 0.15946292 42.54487 0.0566686969 -0.9466312
8 1 -7.8548693 2447.7092 22.043029 58.60251 19.94686 -66.457379 0.22115706 39.63999 0.2983443034 -1.7753911
9 1 94.8736726 1525.3144 24.484066 69.30044 15.93352 25.573234 0.14111563 43.12898 0.0231796755 0.6505701
10 1 54.4643001 2473.2234 -7.656520 23.34022 21.83043 31.124076 0.26489650 42.74790 0.0879837510 0.8558507
.. ... ... ... ... ... ... ... ... ... ... ...
Run Code Online (Sandbox Code Playgroud)glance:
# get the summary statistics by group in a tidy data_frame
dfHourSumm = glance(dfHour, fitHour)
dfHourSumm
Run Code Online (Sandbox Code Playgroud)
这使,
Source: local data frame [24 x 12]
Groups: hour
hour r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual
1 1 0.12364561 0.02627290 42.41546 1.2698179 0.30487225 3 -106.8769 221.7538 225.9319 32383.29 18
2 2 0.03506944 -0.07214506 36.79189 0.3270961 0.72521125 3 -103.8900 215.7799 219.9580 24365.58 18
3 3 0.02805424 -0.07993974 39.33621 0.2597760 0.77406651 3 -105.2942 218.5884 222.7665 27852.07 18
4 4 0.17640603 0.08489559 41.37115 1.9277147 0.17434859 3 -106.3534 220.7068 224.8849 30808.30 18
5 5 0.12575453 0.02861615 42.27865 1.2945915 0.29833246 3 -106.8091 221.6181 225.7962 32174.72 18
6 6 0.08114417 -0.02095092 35.80062 0.7947901 0.46690268 3 -103.3164 214.6328 218.8109 23070.31 18
7 7 0.21339168 0.12599076 32.77309 2.4415266 0.11529934 3 -101.4609 210.9218 215.0999 19333.36 18
8 8 0.21655629 0.12950699 40.92788 2.4877430 0.11119114 3 -106.1272 220.2543 224.4324 30151.65 18
9 9 0.23388711 0.14876346 35.48431 2.7476160 0.09091487 3 -103.1300 214.2601 218.4381 22664.45 18
10 10 0.18326177 0.09251307 40.77241 2.0194425 0.16171339 3 -106.0472 220.0945 224.2726 29923.01 18
.. ... ... ... ... ... ... .. ... ... ... ... ...
Run Code Online (Sandbox Code Playgroud)had*_*ley 26
在dplyr 0.4中,你可以这样做:
df.h %>% do(model = lm(price ~ wind + temp, data = .))
Run Code Online (Sandbox Code Playgroud)
lok*_*oki 14
到 2020 年年中,tchakravarty 的回答将失败。为了规避broom和dpylr似乎交互的新方法broom::tidy,可以使用broom::augment和的以下组合broom::glance。我们只需要在内部使用它们,do()然后再unnest()使用 tibble。
library(dplyr)
library(broom)
df.h = data.frame(
hour = factor(rep(1:24, each = 21)),
price = runif(504, min = -10, max = 125),
wind = runif(504, min = 0, max = 2500),
temp = runif(504, min = - 10, max = 25)
)
df.h %>% group_by(hour) %>%
do(fitHour = tidy(lm(price ~ wind + temp, data = .))) %>%
unnest(fitHour)
# # A tibble: 72 x 6
# hour term estimate std.error statistic p.value
# <fct> <chr> <dbl> <dbl> <dbl> <dbl>
# 1 1 (Intercept) 82.4 18.1 4.55 0.000248
# 2 1 wind -0.0212 0.0108 -1.96 0.0655
# 3 1 temp -1.01 0.792 -1.28 0.218
# 4 2 (Intercept) 25.9 19.7 1.31 0.206
# 5 2 wind 0.0204 0.0131 1.57 0.135
# 6 2 temp 0.680 1.01 0.670 0.511
# 7 3 (Intercept) 88.3 15.5 5.69 0.0000214
# 8 3 wind -0.0188 0.00998 -1.89 0.0754
# 9 3 temp -0.669 0.653 -1.02 0.319
# 10 4 (Intercept) 73.4 14.2 5.17 0.0000639
df.h %>% group_by(hour) %>%
do(fitHour = augment(lm(price ~ wind + temp, data = .))) %>%
unnest(fitHour)
# # A tibble: 24 x 13
# hour r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance
# <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 0.246 0.162 39.0 2.93 0.0790 2 -105. 218. 222. 27334.
# 2 2 0.161 0.0674 43.5 1.72 0.207 2 -107. 223. 227. 34029.
# 3 3 0.192 0.102 33.9 2.14 0.147 2 -102. 212. 217. 20739.
# 4 4 0.0960 -0.00445 34.3 0.956 0.403 2 -102. 213. 217. 21169.
# 5 5 0.230 0.144 31.7 2.68 0.0955 2 -101. 210. 214. 18088.
# 6 6 0.0190 -0.0900 39.8 0.174 0.842 2 -106. 219. 223. 28507.
# 7 7 0.0129 -0.0967 37.1 0.118 0.889 2 -104. 216. 220. 24801.
# 8 8 0.197 0.108 35.3 2.21 0.139 2 -103. 214. 218. 22438.
# 9 9 0.0429 -0.0634 39.4 0.403 0.674 2 -105. 219. 223. 27918.
# 10 10 0.0943 -0.00633 35.6 0.937 0.410 2 -103. 214. 219. 22854.
# # … with 14 more rows, and 2 more variables: df.residual <int>, nobs <int>
df.h %>% group_by(hour) %>%
do(fitHour = glance(lm(price ~ wind + temp, data = .))) %>%
unnest(fitHour)
# # A tibble: 504 x 10
# hour price wind temp .fitted .resid .std.resid .hat .sigma .cooksd
# <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 94.2 883. -6.64 70.4 23.7 0.652 0.129 39.6 0.0209
# 2 1 19.3 2107. 2.40 35.4 -16.0 -0.431 0.0864 39.9 0.00584
# 3 1 60.5 2161. 18.3 18.1 42.5 1.18 0.146 38.5 0.0795
# 4 1 116. 1244. 12.0 44.0 71.9 1.91 0.0690 35.8 0.0902
# 5 1 117. 1624. -8.05 56.1 60.6 1.67 0.128 36.9 0.136
# 6 1 75.0 220. -0.838 78.6 -3.58 -0.101 0.175 40.1 0.000724
# 7 1 106. 765. 6.15 60.0 45.7 1.22 0.0845 38.4 0.0461
# 8 1 -9.89 2055. 12.3 26.5 -36.4 -0.979 0.0909 39.0 0.0319
# 9 1 96.1 215. -8.36 86.3 9.82 0.287 0.232 40.0 0.00830
# 10 1 27.2 323. 22.4 52.9 -25.7 -0.777 0.278 39.4 0.0774
# # … with 494 more rows
Run Code Online (Sandbox Code Playgroud)
感谢Bob Muenchen 的博客对此的启发。
来自以下文件do:
.f:适用于每件作品的功能.提供给.f的第一个未命名参数将是一个数据框.
所以:
reg.models <- do(df.h,
.f=function(data){
lm(price ~ wind + temp, data=data)
})
Run Code Online (Sandbox Code Playgroud)
可能还可以节省模型适合的时间:
reg.models <- do(df.h,
.f=function(data){
m <- lm(price ~ wind + temp, data=data)
m$hour <- unique(data$hour)
m
})
Run Code Online (Sandbox Code Playgroud)
小智 9
我相信有一个比洛基的答案更紧凑的答案,它放弃了“sincereplaced/ superseded ” do():
library(dplyr)\nlibrary(broom)\nlibrary(tidyr)\n\nh.lm <- df.h %>%\n nest_by(hour) %>%\n mutate(fitHour = list(lm(price ~ wind + temp, data = data))) %>%\n summarise(tidy_out = list(tidy(fitHour)),\n glance_out = list(glance(fitHour)),\n augment_out = list(augment(fitHour))) %>%\n ungroup()\n\nh.lm\n# # A tibble: 24 x 4\n# hour tidy_out glance_out augment_out\n# <fct> <list> <list> <list>\n# 1 1 <tibble [3 \xc3\x97 5]> <tibble [1 \xc3\x97 12]> <tibble [21 \xc3\x97 9]>\n# 2 2 <tibble [3 \xc3\x97 5]> <tibble [1 \xc3\x97 12]> <tibble [21 \xc3\x97 9]>\n# 3 3 <tibble [3 \xc3\x97 5]> <tibble [1 \xc3\x97 12]> <tibble [21 \xc3\x97 9]>\n# 4 4 <tibble [3 \xc3\x97 5]> <tibble [1 \xc3\x97 12]> <tibble [21 \xc3\x97 9]>\n# 5 5 <tibble [3 \xc3\x97 5]> <tibble [1 \xc3\x97 12]> <tibble [21 \xc3\x97 9]>\n# 6 6 <tibble [3 \xc3\x97 5]> <tibble [1 \xc3\x97 12]> <tibble [21 \xc3\x97 9]>\n# 7 7 <tibble [3 \xc3\x97 5]> <tibble [1 \xc3\x97 12]> <tibble [21 \xc3\x97 9]>\n# 8 8 <tibble [3 \xc3\x97 5]> <tibble [1 \xc3\x97 12]> <tibble [21 \xc3\x97 9]>\n# 9 9 <tibble [3 \xc3\x97 5]> <tibble [1 \xc3\x97 12]> <tibble [21 \xc3\x97 9]>\n# 10 10 <tibble [3 \xc3\x97 5]> <tibble [1 \xc3\x97 12]> <tibble [21 \xc3\x97 9]>\n# # \xe2\x80\xa6 with 14 more rows\nRun Code Online (Sandbox Code Playgroud)\n与他们的答案类似,为了访问,只需取消嵌套所需的任何组件即可:
\nunnest(select(h.lm, hour, tidy_out))\n# # A tibble: 72 x 6\n# hour term estimate std.error statistic p.value\n# <fct> <chr> <dbl> <dbl> <dbl> <dbl>\n# 1 1 (Intercept) 63.2 20.9 3.02 0.00728\n# 2 1 wind -0.00237 0.0139 -0.171 0.866\n# 3 1 temp -0.266 0.950 -0.280 0.783\n# 4 2 (Intercept) 65.1 23.0 2.83 0.0111\n# 5 2 wind 0.00691 0.0129 0.535 0.599\n# 6 2 temp -0.448 0.877 -0.510 0.616\n# 7 3 (Intercept) 65.2 17.8 3.67 0.00175\n# 8 3 wind 0.00515 0.0112 0.458 0.652\n# 9 3 temp -1.87 0.695 -2.69 0.0148\n# 10 4 (Intercept) 49.7 17.6 2.83 0.0111\n# # \xe2\x80\xa6 with 62 more rows\nRun Code Online (Sandbox Code Playgroud)\n
我认为您可以dplyr以更合适的方式使用,而不需要在@fabians anwser中定义函数.
results<-df.h %.%
group_by(hour) %.%
do(failwith(NULL, lm), formula = price ~ wind + temp)
Run Code Online (Sandbox Code Playgroud)
要么
results<-do(group_by(tbl_df(df.h), hour),
failwith(NULL, lm), formula = price ~ wind + temp)
Run Code Online (Sandbox Code Playgroud)
编辑:
当然它也没有failwith
results<-df.h %.%
group_by(hour) %.%
do(lm, formula = price ~ wind + temp)
results<-do(group_by(tbl_df(df.h), hour),
lm, formula = price ~ wind + temp)
Run Code Online (Sandbox Code Playgroud)
tidyverse 的一些修订后来do()被取代,我们可以用更少的一行代码来适应每组一个模型。
library("broom")\nlibrary("tidyverse")\n\ndf.h <- data.frame(\n hour = factor(rep(1:24, each = 21)),\n price = runif(504, min = -10, max = 125),\n wind = runif(504, min = 0, max = 2500),\n temp = runif(504, min = -10, max = 25)\n)\n\ndf.h %>%\n group_by(hour) %>%\n group_modify(\n # Use `tidy`, `glance` or `augment` to extract different information from the fitted models.\n ~ tidy(lm(price ~ wind + temp, data = .))\n )\n#> # A tibble: 72 \xc3\x97 6\n#> # Groups: hour [24]\n#> hour term estimate std.error statistic p.value\n#> <fct> <chr> <dbl> <dbl> <dbl> <dbl>\n#> 1 1 (Intercept) 73.9 16.3 4.52 0.000266\n#> 2 1 wind -0.0256 0.0119 -2.15 0.0456 \n#> 3 1 temp 1.72 0.861 2.00 0.0604 \n#> 4 2 (Intercept) 81.5 18.4 4.42 0.000331\n#> 5 2 wind -0.0111 0.00973 -1.14 0.270 \n#> 6 2 temp -1.60 0.763 -2.09 0.0506 \n#> 7 3 (Intercept) 59.9 16.1 3.73 0.00154 \n#> 8 3 wind 0.00358 0.0102 0.349 0.731 \n#> 9 3 temp -1.82 0.664 -2.74 0.0134 \n#> 10 4 (Intercept) 49.6 18.5 2.69 0.0151 \n#> # \xe2\x80\xa6 with 62 more rows\nRun Code Online (Sandbox Code Playgroud)\n由reprex 包(v2.0.1)于 2022 年 4 月 20 日创建
\n从 dplyr 1.0.0 开始,group_split提供了此操作的便捷快捷方式:
library(dplyr)\n#> Warning: package \'dplyr\' was built under R version 4.2.3\n#> \n#> Attaching package: \'dplyr\'\n#> The following objects are masked from \'package:stats\':\n#> \n#> filter, lag\n#> The following objects are masked from \'package:base\':\n#> \n#> intersect, setdiff, setequal, union\nlibrary(broom)\n#> Warning: package \'broom\' was built under R version 4.2.2\nlibrary(purrr)\n#> Warning: package \'purrr\' was built under R version 4.2.2\ndf.h <- data.frame( \n hour = factor(rep(1:24, each = 21)),\n price = runif(504, min = -10, max = 125),\n wind = runif(504, min = 0, max = 2500),\n temp = runif(504, min = - 10, max = 25) \n)\n\ndf.g <- group_split(df.h, hour)\nmap_dfr(df.g, function(x) {\n tidy(lm(price ~ wind + temp, data=x)) |> \n mutate(hour = x$hour[[1]])\n })\n#> # A tibble: 72 \xc3\x97 6\n#> term estimate std.error statistic p.value hour \n#> <chr> <dbl> <dbl> <dbl> <dbl> <fct>\n#> 1 (Intercept) 115. 25.4 4.53 0.000260 1 \n#> 2 wind -0.00627 0.0129 -0.487 0.632 1 \n#> 3 temp -2.57 1.26 -2.04 0.0568 1 \n#> 4 (Intercept) 71.0 16.6 4.28 0.000455 2 \n#> 5 wind 0.00262 0.0112 0.233 0.818 2 \n#> 6 temp -0.824 0.834 -0.989 0.336 2 \n#> 7 (Intercept) 39.3 22.5 1.74 0.0984 3 \n#> 8 wind 0.000342 0.0137 0.0250 0.980 3 \n#> 9 temp -0.248 0.964 -0.257 0.800 3 \n#> 10 (Intercept) 56.1 21.6 2.59 0.0184 4 \n#> # \xe2\x84\xb9 62 more rows\nRun Code Online (Sandbox Code Playgroud)\n创建于 2023-05-15,使用reprex v2.0.2
\n