用dplyr汇总多列?

Dan*_*iel 141 aggregate r dplyr

我在使用dplyr语法时遇到了一些麻烦.我有一个包含不同变量和一个分组变量的数据框.现在我想使用R中的dplyr计算每个组中每列的平均值.

df <- data.frame(
    a = sample(1:5, n, replace = TRUE), 
    b = sample(1:5, n, replace = TRUE), 
    c = sample(1:5, n, replace = TRUE), 
    d = sample(1:5, n, replace = TRUE), 
    grp = sample(1:3, n, replace = TRUE)
)
df %>% group_by(grp) %>% summarise(mean(a))
Run Code Online (Sandbox Code Playgroud)

这给出了"grp"表示的每个组的列"a"的平均值.

我的问题是:是否有可能同时获得每个组中每列的方法?或者我必须df %>% group_by(grp) %>% summarise(mean(a))为每一栏重复一次?

我想拥有的是什么

df %>% group_by(grp) %>% summarise(mean(a:d)) # "mean(a:d)" does not work
Run Code Online (Sandbox Code Playgroud)

Art*_*sov 247

dplyr软件包包含summarise_all以下目标:

df %>% group_by(grp) %>% summarise_all(funs(mean))
#> Source: local data frame [3 x 5]
#> 
#>     grp        a        b        c        d
#>   (int)    (dbl)    (dbl)    (dbl)    (dbl)
#> 1     1 3.000000 2.666667 2.666667 3.333333
#> 2     2 2.666667 2.666667 2.500000 2.833333
#> 3     3 4.000000 1.000000 4.000000 3.000000
Run Code Online (Sandbox Code Playgroud)

如果只想汇总某些列,请使用summarise_at或使用summarise_if函数.

或者,该purrrlyr包提供相同的功能:

df %>% slice_rows("grp") %>% dmap(mean)
#> Source: local data frame [3 x 5]
#> 
#>     grp        a        b        c        d
#>   (int)    (dbl)    (dbl)    (dbl)    (dbl)
#> 1     1 3.000000 2.666667 2.666667 3.333333
#> 2     2 2.666667 2.666667 2.500000 2.833333
#> 3     3 4.000000 1.000000 4.000000 3.000000
Run Code Online (Sandbox Code Playgroud)

另外不要忘记data.table:

setDT(df)[, lapply(.SD, mean), by = grp]
#>    grp        a        b        c        d
#> 1:   3 3.714286 3.714286 2.428571 2.428571
#> 2:   1 1.000000 4.000000 5.000000 2.000000
#> 3:   2 4.000000 4.500000 3.000000 3.000000
Run Code Online (Sandbox Code Playgroud)

我们试着比较一下性能.

library(dplyr)
library(purrrlyr)
library(data.table)
library(benchr)
n <- 10000
df <- data.frame(
    a = sample(1:5, n, replace = TRUE), 
    b = sample(1:5, n, replace = TRUE), 
    c = sample(1:5, n, replace = TRUE), 
    d = sample(1:5, n, replace = TRUE), 
    grp = sample(1:3, n, replace = TRUE)
)
dt <- setDT(df)
benchmark(
    dplyr = df %>% group_by(grp) %>% summarise_all(funs(mean)),
    purrrlyr = df %>% slice_rows("grp") %>% dmap(mean),
    data.table = dt[, lapply(.SD, mean), by = grp]
)
#> Benchmark summary:
#> Time units : microseconds 
#>        expr n.eval  min lw.qu median mean up.qu   max  total relative
#>       dplyr    100 3490  3550   3710 3890  3780 15100 389000     6.98
#>    purrrlyr    100 2540  2590   2680 2920  2860 12000 292000     5.04
#>  data.table    100  459   500    531  563   571  1380  56300     1.00
Run Code Online (Sandbox Code Playgroud)

  • @piotr:`funs(mean(.,na.rm = TRUE))`而不是`funs(mean)`. (8认同)

Kei*_*iku 50

我们可以通过总结summarize_at,summarize_all以及summarize_ifdplyr 0.7.4.我们可以使用varsfuns参数设置多个列和函数,如下面的代码.funs公式的左侧被指定为汇总变量的后缀.在dplyr 0.7.4,summarise_each(和mutate_each)已经弃用,所以我们不能使用这些函数.

options(scipen = 100, dplyr.width = Inf, dplyr.print_max = Inf)

library(dplyr)
packageVersion("dplyr")
# [1] ‘0.7.4’

set.seed(123)
df <- data_frame(
  a = sample(1:5, 10, replace=T), 
  b = sample(1:5, 10, replace=T), 
  c = sample(1:5, 10, replace=T), 
  d = sample(1:5, 10, replace=T), 
  grp = as.character(sample(1:3, 10, replace=T)) # For convenience, specify character type
)

df %>% group_by(grp) %>% 
  summarise_each(.vars = letters[1:4],
                 .funs = c(mean="mean"))
# `summarise_each()` is deprecated.
# Use `summarise_all()`, `summarise_at()` or `summarise_if()` instead.
# To map `funs` over a selection of variables, use `summarise_at()`
# Error: Strings must match column names. Unknown columns: mean
Run Code Online (Sandbox Code Playgroud)

您应该更改为以下代码.以下代码都具有相同的结果.

# summarise_at
df %>% group_by(grp) %>% 
  summarise_at(.vars = letters[1:4],
               .funs = c(mean="mean"))

df %>% group_by(grp) %>% 
  summarise_at(.vars = names(.)[1:4],
               .funs = c(mean="mean"))

df %>% group_by(grp) %>% 
  summarise_at(.vars = vars(a,b,c,d),
               .funs = c(mean="mean"))

# summarise_all
df %>% group_by(grp) %>% 
  summarise_all(.funs = c(mean="mean"))

# summarise_if
df %>% group_by(grp) %>% 
  summarise_if(.predicate = function(x) is.numeric(x),
               .funs = funs(mean="mean"))
# A tibble: 3 x 5
# grp a_mean b_mean c_mean d_mean
# <chr>  <dbl>  <dbl>  <dbl>  <dbl>
# 1     1   2.80   3.00    3.6   3.00
# 2     2   4.25   2.75    4.0   3.75
# 3     3   3.00   5.00    1.0   2.00
Run Code Online (Sandbox Code Playgroud)

您还可以拥有多种功能.

df %>% group_by(grp) %>% 
  summarise_at(.vars = letters[1:2],
               .funs = c(Mean="mean", Sd="sd"))
# A tibble: 3 x 5
# grp a_Mean b_Mean      a_Sd     b_Sd
# <chr>  <dbl>  <dbl>     <dbl>    <dbl>
# 1     1   2.80   3.00 1.4832397 1.870829
# 2     2   4.25   2.75 0.9574271 1.258306
# 3     3   3.00   5.00        NA       NA
Run Code Online (Sandbox Code Playgroud)

  • 但是,如果我想要为第1-13列,sd为第14-30列,第31-100列的总和,并且不想将它们全部列出,该怎么办? (2认同)
  • 我赞成你的评论,因为我昨天发布了这个问题 [R summarise_at generated by condition :mean for some columns, sum for other](/sf/ask/4221480021/)。 (2认同)

Pau*_*tra 34

您可以简单地将更多参数传递给summarise:

df %>% group_by(grp) %>% summarise(mean(a), mean(b), mean(c), mean(d))
Run Code Online (Sandbox Code Playgroud)

来源:本地数据框[3 x 5]

  grp  mean(a)  mean(b)  mean(c) mean(d)
1   1 2.500000 3.500000 2.000000     3.0
2   2 3.800000 3.200000 3.200000     2.8
3   3 3.666667 3.333333 2.333333     3.0
Run Code Online (Sandbox Code Playgroud)

  • `dplyr`现在有`summarise_each`,它将对每一列进行操作 (13认同)
  • 那是'dplyr`中的TODO我相信(就像`plyr``colwise`),请看这里有一个相当尴尬的当前解决方案:http://stackoverflow.com/a/21296364/1527403 (4认同)
  • 大!如果列名和计数未知,是否可以执行此类操作?例如,有3个或6个而不是4个固定列? (2认同)

小智 6

为了完整:有dplyr V0.2 ddplycolwise也将这样做:

> ddply(df, .(grp), colwise(mean))
  grp        a    b        c        d
1   1 4.333333 4.00 1.000000 2.000000
2   2 2.000000 2.75 2.750000 2.750000
3   3 3.000000 4.00 4.333333 3.666667
Run Code Online (Sandbox Code Playgroud)

但速度较慢,至少在这种情况下:

> microbenchmark(ddply(df, .(grp), colwise(mean)), 
                  df %>% group_by(grp) %>% summarise_each(funs(mean)))
Unit: milliseconds
                                            expr      min       lq     mean
                ddply(df, .(grp), colwise(mean))     3.278002 3.331744 3.533835
 df %>% group_by(grp) %>% summarise_each(funs(mean)) 1.001789 1.031528 1.109337

   median       uq      max neval
 3.353633 3.378089 7.592209   100
 1.121954 1.133428 2.292216   100
Run Code Online (Sandbox Code Playgroud)


Mat*_*cho 5

所有的例子都很棒,但我想我会再添加一个来展示以“整洁”的格式工作如何简化事情。现在数据框采用“宽”格式,这意味着变量“a”到“d”以列表示。为了得到一个“整洁”(或长期)格式,可以使用gather()tidyr该移动列变量“a”到“d”为行包。然后您使用group_by()summarize()函数来获得每个组的平均值。如果您想以宽格式显示数据,只需额外调用该spread()函数即可。


library(tidyverse)

# Create reproducible df
set.seed(101)
df <- tibble(a   = sample(1:5, 10, replace=T), 
             b   = sample(1:5, 10, replace=T), 
             c   = sample(1:5, 10, replace=T), 
             d   = sample(1:5, 10, replace=T), 
             grp = sample(1:3, 10, replace=T))

# Convert to tidy format using gather
df %>%
    gather(key = variable, value = value, a:d) %>%
    group_by(grp, variable) %>%
    summarize(mean = mean(value)) %>%
    spread(variable, mean)
#> Source: local data frame [3 x 5]
#> Groups: grp [3]
#> 
#>     grp        a     b        c        d
#> * <int>    <dbl> <dbl>    <dbl>    <dbl>
#> 1     1 3.000000   3.5 3.250000 3.250000
#> 2     2 1.666667   4.0 4.666667 2.666667
#> 3     3 3.333333   3.0 2.333333 2.333333
Run Code Online (Sandbox Code Playgroud)