用扫帚按组从回归中检索整洁的结果

C. *_*Rea 8 r dplyr broom tidymodels

这个问题的答案清楚地解释了在通过 dplyr 管道运行回归时如何按组检索整洁的回归结果,但该解决方案不再可重现。

如何组合使用 dplyr 和 broom 来按组运行回归并使用 R 4.02、dplyr 1.0.0 和 broom 0.7.0 检索整洁的结果?

具体来说,上面链接的问题的示例答案,

library(dplyr)
library(broom)

df.h = data.frame( 
  hour     = factor(rep(1:24, each = 21)),
  price    = runif(504, min = -10, max = 125),
  wind     = runif(504, min = 0, max = 2500),
  temp     = runif(504, min = - 10, max = 25)  
)

dfHour = df.h %>% group_by(hour) %>%
  do(fitHour = lm(price ~ wind + temp, data = .))

# get the coefficients by group in a tidy data_frame
dfHourCoef = tidy(dfHour, fitHour)
Run Code Online (Sandbox Code Playgroud)

当我在我的系统上运行它时返回以下错误(和三个警告):

Error in var(if (is.vector(x) || is.factor(x)) x else as.double(x), na.rm = na.rm) : 
  Calling var(x) on a factor x is defunct.
  Use something like 'all(duplicated(x)[-1L])' to test for a constant vector.
In addition: Warning messages:
1: Data frame tidiers are deprecated and will be removed in an upcoming release of broom. 
2: In mean.default(X[[i]], ...) :
  argument is not numeric or logical: returning NA
3: In mean.default(X[[i]], ...) :
  argument is not numeric or logical: returning NA
Run Code Online (Sandbox Code Playgroud)

如果我重新格式化df.h$hour为字符而不是因子,

df.h <- df.h %>%
  mutate(
    hour = as.character(hour)
  )
Run Code Online (Sandbox Code Playgroud)

按组重新运行回归,并再次尝试使用检索结果broom::tidy

dfHour = df.h %>% group_by(hour) %>%
  do(fitHour = lm(price ~ wind + temp, data = .))

# get the coefficients by group in a tidy data_frame
dfHourCoef = tidy(dfHour, fitHour)
Run Code Online (Sandbox Code Playgroud)

我收到此错误:

Error in var(if (is.vector(x) || is.factor(x)) x else as.double(x), na.rm = na.rm) : 
  is.atomic(x) is not TRUE
Run Code Online (Sandbox Code Playgroud)

我认为该问题与组级回归结果作为列表存储dfHour$fitHour在最初发布的代码/答案。

gle*_*ton 6

****** 更新了从 dplyr 1.0.0 发行说明中提取的更简洁的代码 ******

谢谢你。我正在努力解决与使用提供的链接中的示例相关的 dplyr 1.0.0 更新的类似问题。这是一个有用的问题和答案。

作为仅供参考,do() 已被 dplyr 1.0.0 取代,因此可以考虑使用更新的语言(现在我的更新非常有效):

dfHour = df.h %>% 
  # replace group_by() with nest_by() 
  # to convert your model data to a vector of lists
  nest_by(hour) %>%
  # change do() to mutate(), then add list() before your model
  # make sure to change data = .  to data = data
  mutate(fitHour = list(lm(price ~ wind + temp, data = data))) %>%
  summarise(tidy(mod))
Run Code Online (Sandbox Code Playgroud)

完毕!

这提供了一个非常有效的 df 选择输出统计数据。最后一行替换了以下代码(来自我的原始响应),它执行相同的操作,但不太容易:

ungroup() %>%
  # then leverage the feedback from @akrun
  transmute(hour, HourCoef = map(fitHour, tidy)) %>%
  unnest(HourCoef)

dfHour
Run Code Online (Sandbox Code Playgroud)

这给出了输出:

# A tibble: 72 x 6
   hour  term         estimate std.error statistic  p.value
   <fct> <chr>           <dbl>     <dbl>     <dbl>    <dbl>
 1 1     (Intercept) 68.6        21.0       3.27   0.00428 
 2 1     wind         0.000558    0.0124    0.0450 0.965   
 3 1     temp        -0.866       0.907    -0.954  0.353   
 4 2     (Intercept) 31.9        17.4       1.83   0.0832  
 5 2     wind         0.00950     0.0113    0.838  0.413   
 6 2     temp         1.69        0.802     2.11   0.0490  
 7 3     (Intercept) 85.5        22.3       3.83   0.00122 
 8 3     wind        -0.0210      0.0165   -1.27   0.220   
 9 3     temp         0.276       1.14      0.243  0.811   
10 4     (Intercept) 73.3        15.1       4.86   0.000126
# ... with 62 more rows
Run Code Online (Sandbox Code Playgroud)

感谢您的耐心,我自己正在解决这个问题!


akr*_*run 5

问题是rowwise调用后有一个分组属性do,并且“fitHour”列是一个list. 我们可以将with和itungroup循环到一个列listmaptidylist

\n
library(dplyr)\nlibrary(purrr)\nlibrary(broom)\ndf.h %>% \n     group_by(hour) %>%\n     do(fitHour = lm(price ~ wind + temp, data = .)) %>% \n     ungroup %>% \n     mutate(HourCoef = map(fitHour, tidy))\n
Run Code Online (Sandbox Code Playgroud)\n
\n

unnest或者在之后使用mtuate

\n
df.h %>% \n      group_by(hour) %>%\n      do(fitHour = lm(price ~ wind + temp, data = .)) %>% \n      ungroup %>% \n      transmute(hour, HourCoef = map(fitHour, tidy)) %>% \n      unnest(HourCoef)\n# A tibble: 72 x 6\n#   hour  term        estimate std.error statistic  p.value\n#   <fct> <chr>          <dbl>     <dbl>     <dbl>    <dbl>\n# 1 1     (Intercept) 89.8       20.2       4.45   0.000308\n# 2 1     wind         0.00493    0.0151    0.326  0.748   \n# 3 1     temp        -1.84       1.08     -1.71   0.105   \n# 4 2     (Intercept) 75.6       23.7       3.20   0.00500 \n# 5 2     wind        -0.00910    0.0146   -0.622  0.542   \n# 6 2     temp         0.192      0.853     0.225  0.824   \n# 7 3     (Intercept) 44.0       23.9       1.84   0.0822  \n# 8 3     wind        -0.00158    0.0166   -0.0953 0.925   \n# 9 3     temp         0.622      1.19      0.520  0.609   \n#10 4     (Intercept) 57.8       18.9       3.06   0.00676 \n# \xe2\x80\xa6 with 62 more rows\n
Run Code Online (Sandbox Code Playgroud)\n
\n

如果我们想要单个数据集pull“fitHour”,请循环使用 with list,通过行绑定(后缀)map将其压缩为单个数据集_dfr

\n
df.h %>%\n    group_by(hour) %>%  \n    do(fitHour = lm(price ~ wind + temp, data = .)) %>% \n    ungroup %>% \n    pull(fitHour) %>% \n    map_dfr(tidy, .id = 'grp')\n
Run Code Online (Sandbox Code Playgroud)\n
\n

注意:OP的错误消息可以用R 4.02,dplyr 1.0.0和复制broom 0.7.0

\n
tidy(dfHour,fitHour)\n
Run Code Online (Sandbox Code Playgroud)\n
\n

var(if (is.vector(x) || is.factor(x)) x else as.double(x),\nna.rm = na.rm) 中的错误:\n对因子 x 调用 var(x)已失效。\n使用类似“all(duplicated(x)[-1L])”的内容来测试常量向量。\n此外:警告消息:\n1:数据框 tidiers 已弃用,并将在即将发布的版本中删除扫帚。\n2:在mean.default(X[[i]], ...) 中:

\n
\n