函数调用 dplyr 中 group_by 的变量名称 - 如何在函数中矢量化此变量？

Question

函数调用 dplyr 中 group_by 的变量名称 - 如何在函数中矢量化此变量？

我创建了一个函数，R它采用固定的数据框，并用于dplyr为我提供按所选参数变量分组的汇总统计数据（例如，特定变量的平均值）。这是一些显示玩具数据框和我的功能的代码：

#Create data frame for analysis
DF <- data.frame(Type1  = c(0,0,1,1,0,1,1,0,1,0,1,1,1,0),
                 Type2  = c(1,1,1,1,1,1,2,2,2,2,3,3,3,3),
                 Output = c(4,2,7,5,1,1,7,8,3,2,5,4,3,6));

#Inspect the data-frame
DF;

   Type1 Type2 Output
1      0     1      4
2      0     1      2
3      1     1      7
4      1     1      5
5      0     1      1
6      1     1      1
7      1     2      7
8      0     2      8
9      1     2      3
10     0     2      2
11     1     3      5
12     1     3      4
13     1     3      3
14     0     3      6

#Create a function that summarises the mean output grouped by input variable
MEAN_OUT <- function(VAR) { DF %>% group_by(!! sym(VAR)) %>% 
                                   summarise(Mean = mean(Output)) %>% 
                                   as.data.frame(); }

#Call the function grouping by variable 'Type1'
MEAN_OUT('Type1')

  Type1     Mean
1     0 3.714286
2     1 4.444444

Run Code Online (Sandbox Code Playgroud)

目前我可以调用MEAN_OUT('Type1')orMEAN_OUT('Type2')这些给我按这些参数变量中的任何一个分组的正确摘要。但是，我还希望能够调用MEAN_OUT(c('Type1','Type2'))以获取对两个变量进行分组的摘要。您可以在dplyr::group_by函数中执行此操作，但是当此材料包含在我的函数中时，我无法弄清楚如何执行此操作。如果我使用我的当前函数（如上所示）尝试按两个变量分组，我会收到以下错误：

MEAN_OUT(c('Type1','Type2'))
Error: Only strings can be converted to symbols

Run Code Online (Sandbox Code Playgroud)

Answer 1

akr*_*run 5

这将是更好地使用syms，如果目的是要传递多个分组变量作为vector

library(dplyr)
library(rlang)
MEAN_OUT <- function(VARS) { 
                 DF %>% 
                    group_by(!!! syms(VARS)) %>% 
                    summarise(Mean = mean(Output)) %>% 
                    as.data.frame() 
         }

Run Code Online (Sandbox Code Playgroud)

但是，我们可以利用group_by_at可以将字符串作为输入的避免syms和评估 ( !!!)

MEAN_OUT2 <- function(VARS) {
                DF %>% 
                     group_by_at(VARS) %>% 
                     summarise(Mean = mean(Output)) %>% 
                     as.data.frame()
    }

Run Code Online (Sandbox Code Playgroud)

-测试

identical(MEAN_OUT('Type1'), MEAN_OUT2('Type1'))
#[1] TRUE

identical(MEAN_OUT(c('Type1', 'Type2')), MEAN_OUT2(c('Type1', 'Type2')))
#[1] TRUE

Run Code Online (Sandbox Code Playgroud)

除了作为带引号的字符串传递之外，还有其他选项可以作为 quosure 传递

MEAN_OUT3 <- function(VARS) {
                    DF %>% 
                        group_by(!!! VARS) %>% 
                               summarise(Mean = mean(Output)) %>% 
                               as.data.frame() 
                                  }

identical(MEAN_OUT('Type1'), MEAN_OUT3(quos(Type1)))
#[1] TRUE
identical(MEAN_OUT(c('Type1', 'Type2')), MEAN_OUT3(quos(Type1, Type2)))
#[1] TRUE

Run Code Online (Sandbox Code Playgroud)

或者quos通过传递参数来调用函数内部...

MEAN_OUT4 <- function(...) {

                    DF %>% 
                        group_by(!!! quos(...)) %>% 
                               summarise(Mean = mean(Output)) %>% 
                               as.data.frame() 
                                  }

identical(MEAN_OUT('Type1'), MEAN_OUT4(Type1))
#[1] TRUE

identical(MEAN_OUT(c('Type1', 'Type2')), MEAN_OUT4(Type1, Type2))
#[1] TRUE

Run Code Online (Sandbox Code Playgroud)

Answer 2

小智 1

@akrun的答案提供了一个可行的解决方案，但我认为这是将函数参数包装在 vars() 中的理想情况，将要分组的变量作为准引用传递，dplyr 可以在正文中没有任何显式 tidyeval 代码的情况下进行解释的函数。

library(tidyverse)
#> -- Attaching packages ------------------------------------ tidyverse 1.2.1 --
#> v ggplot2 3.0.0     v purrr   0.2.5
#> v tibble  1.4.2     v dplyr   0.7.6
#> v tidyr   0.8.0     v stringr 1.3.1
#> v readr   1.1.1     v forcats 0.3.0
#> -- Conflicts --------------------------------------- tidyverse_conflicts() --
#> x dplyr::filter() masks stats::filter()
#> x dplyr::lag()    masks stats::lag()
# Create data frame for analysis
dat <- data.frame(
  Type1  = c(0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0),
  Type2  = c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3),
  Output = c(4, 2, 7, 5, 1, 1, 7, 8, 3, 2, 5, 4, 3, 6)
)
# using the dplyr::vars() quoting function has 3 main advantages: 
# 1. It makes functions neater
mean_out <- function(.vars) {

  dat %>% 

    # group_by will continue to work for basic selections
    # group_by_at allows for full tidyselect functionality 
    group_by_at(.vars) %>% 
    summarise(mean = mean(Output)) 
}
# 2. It lets us harness the power of tidyselect
mean_out(vars(Type1))
#> # A tibble: 2 x 2
#>   Type1  mean
#>   <dbl> <dbl>
#> 1     0  3.83
#> 2     1  4.38
mean_out(vars(Type1, Type2))
#> # A tibble: 6 x 3
#> # Groups:   Type1 [?]
#>   Type1 Type2  mean
#>   <dbl> <dbl> <dbl>
#> 1     0     1  2.33
#> 2     0     2  5   
#> 3     0     3  6   
#> 4     1     1  4.33
#> 5     1     2  5   
#> 6     1     3  4
mean_out(vars(-Output))
#> # A tibble: 6 x 3
#> # Groups:   Type1 [?]
#>   Type1 Type2  mean
#>   <dbl> <dbl> <dbl>
#> 1     0     1  2.33
#> 2     0     2  5   
#> 3     0     3  6   
#> 4     1     1  4.33
#> 5     1     2  5   
#> 6     1     3  4
mean_out(vars(matches("Type")))
#> # A tibble: 6 x 3
#> # Groups:   Type1 [?]
#>   Type1 Type2  mean
#>   <dbl> <dbl> <dbl>
#> 1     0     1  2.33
#> 2     0     2  5   
#> 3     0     3  6   
#> 4     1     1  4.33
#> 5     1     2  5   
#> 6     1     3  4
# 3. It doesn't demand that we load rlang, since it's built into dplyr

Run Code Online (Sandbox Code Playgroud)

归档时间：	7 年，4 月前
查看次数：	2286 次
最近记录：	7 年，4 月前