如何按组加总变量?

use*_*421 320 sorting r r-faq

假设我有两列数据.第一个包含诸如"First","Second","Third"等类别.第二个包含代表我看到"First"的次数的数字.

例如:

Category     Frequency
First        10
First        15
First        5
Second       2
Third        14
Third        20
Second       3
Run Code Online (Sandbox Code Playgroud)

我想按类别对数据进行排序并对频率求和:

Category     Frequency
First        30
Second       5
Third        34
Run Code Online (Sandbox Code Playgroud)

我怎么会在R?

rcs*_*rcs 355

使用aggregate:

aggregate(x$Frequency, by=list(Category=x$Category), FUN=sum)
  Category  x
1    First 30
2   Second  5
3    Third 34
Run Code Online (Sandbox Code Playgroud)

在上面的示例中,可以在中指定多个维度list.可以通过cbind以下方式合并相同数据类型的多个聚合度量标准:

aggregate(cbind(x$Frequency, x$Metric2, x$Metric3) ...
Run Code Online (Sandbox Code Playgroud)

(嵌入@thelatemail评论),aggregate也有一个公式界面

aggregate(Frequency ~ Category, x, sum)
Run Code Online (Sandbox Code Playgroud)

或者,如果要聚合多个列,可以使用.表示法(也适用于一列)

aggregate(. ~ Category, x, sum)
Run Code Online (Sandbox Code Playgroud)

或者tapply:

tapply(x$Frequency, x$Category, FUN=sum)
 First Second  Third 
    30      5     34 
Run Code Online (Sandbox Code Playgroud)

使用此数据:

x <- data.frame(Category=factor(c("First", "First", "First", "Second",
                                      "Third", "Third", "Second")), 
                    Frequency=c(10,15,5,2,14,20,3))
Run Code Online (Sandbox Code Playgroud)

  • @AndrewMcKinlay,R使用波形符来定义符号公式,用于统计和其他功能.它可以解释为*"按类别划分的频率模型"*或*"频率取决于类别"*.并非所有语言都使用特殊运算符来定义符号函数,如此处的R所示.也许通过波浪运算符的"自然语言解释",它变得更有意义(甚至更直观).我个人觉得这个符号公式表示比一些更冗长的替代方案更好. (4认同)
  • 作为 R 新手(并提出与 OP 相同类型的问题),我将从每个替代方案背后的语法的更多细节中受益。例如,如果我有一个更大的源表,并且只想子选择两个维度加上求和指标,我可以采用这些方法中的任何一个吗?很难说。 (2认同)

tal*_*lat 200

最近,您还可以使用dplyr包来实现此目的:

library(dplyr)
x %>% 
  group_by(Category) %>% 
  summarise(Frequency = sum(Frequency))

#Source: local data frame [3 x 2]
#
#  Category Frequency
#1    First        30
#2   Second         5
#3    Third        34
Run Code Online (Sandbox Code Playgroud)

或者,对于多个汇总列(也适用于一列):

x %>% 
  group_by(Category) %>% 
  summarise_all(funs(sum))
Run Code Online (Sandbox Code Playgroud)

更新dplyr> = 0.5: mtcars已取代%>%,mtcars%>%家族的功能dplyr.

或者,如果您有多个要分组的列,则可以mtcars使用逗号分隔所有这些:

# several summary columns with arbitrary names
mtcars %>% 
  group_by(cyl, gear) %>%                            # multiple group columns
  summarise(max_hp = max(hp), mean_mpg = mean(mpg))  # multiple summary columns

# summarise all columns except grouping columns using "sum" 
mtcars %>% 
  group_by(cyl) %>% 
  summarise_all(sum)

# summarise all columns except grouping columns using "sum" and "mean"
mtcars %>% 
  group_by(cyl) %>% 
  summarise_all(funs(sum, mean))

# multiple grouping columns
mtcars %>% 
  group_by(cyl, gear) %>% 
  summarise_all(funs(sum, mean))

# summarise specific variables, not all
mtcars %>% 
  group_by(cyl, gear) %>% 
  summarise_at(vars(qsec, mpg, wt), funs(sum, mean))

# summarise specific variables (numeric columns except grouping columns)
mtcars %>% 
  group_by(gear) %>% 
  summarise_if(is.numeric, funs(mean))
Run Code Online (Sandbox Code Playgroud)

有关更多信息,包括%>%运算符,请参阅dplyr简介.

  • @asieira,哪个最快,差异有多大(或者差异是否明显)将始终取决于您的数据大小.通常,对于大型数据集,例如某些GB,data.table最有可能是最快的.在较小的数据大小上,data.table和dplyr通常很接近,也取决于组的数量.数据,表和dplyr都比基本函数快得多(但是对于某些操作来说可以快100-1000倍).另见[这里](http://stackoverflow.com/questions/21435339/data-table-vs-dplyr-can-one-do-something-well-the-other-cant-or-does-poorly) (4认同)

asi*_*ira 66

rcs提供的答案很简单.但是,如果您正在处理更大的数据集并需要提高性能,那么可以采用更快的替代方案:

library(data.table)
data = data.table(Category=c("First","First","First","Second","Third", "Third", "Second"), 
                  Frequency=c(10,15,5,2,14,20,3))
data[, sum(Frequency), by = Category]
#    Category V1
# 1:    First 30
# 2:   Second  5
# 3:    Third 34
system.time(data[, sum(Frequency), by = Category] )
# user    system   elapsed 
# 0.008     0.001     0.009 
Run Code Online (Sandbox Code Playgroud)

让我们使用data.frame和上面的内容将它与同一个东西进行比较:

data = data.frame(Category=c("First","First","First","Second","Third", "Third", "Second"),
                  Frequency=c(10,15,5,2,14,20,3))
system.time(aggregate(data$Frequency, by=list(Category=data$Category), FUN=sum))
# user    system   elapsed 
# 0.008     0.000     0.015 
Run Code Online (Sandbox Code Playgroud)

如果你想保留列,这就是语法:

data[,list(Frequency=sum(Frequency)),by=Category]
#    Category Frequency
# 1:    First        30
# 2:   Second         5
# 3:    Third        34
Run Code Online (Sandbox Code Playgroud)

对于较大的数据集,差异将变得更加明显,如下面的代码所示:

data = data.table(Category=rep(c("First", "Second", "Third"), 100000),
                  Frequency=rnorm(100000))
system.time( data[,sum(Frequency),by=Category] )
# user    system   elapsed 
# 0.055     0.004     0.059 
data = data.frame(Category=rep(c("First", "Second", "Third"), 100000), 
                  Frequency=rnorm(100000))
system.time( aggregate(data$Frequency, by=list(Category=data$Category), FUN=sum) )
# user    system   elapsed 
# 0.287     0.010     0.296 
Run Code Online (Sandbox Code Playgroud)

对于多个聚合,您可以组合lapply并按.SD如下方式进行组合

data[, lapply(.SD, sum), by = Category]
#    Category Frequency
# 1:    First        30
# 2:   Second         5
# 3:    Third        34
Run Code Online (Sandbox Code Playgroud)

  • +1但0.296 vs 0.059并不是特别令人印象深刻.数据大小需要远大于300k行,并且有超过3组,data.table要闪耀.例如,我们将尝试并支持超过20亿行,因为一些data.table用户拥有250GB的RAM,而GNU R现在支持长度> 2 ^ 31. (11认同)
  • 仅当Frequency列中的所有值都等于1时,使用.N将等于sum(频率),因为.N计算每个聚合集(.SD)中的行数.这不是这种情况. (3认同)
  • 真正.事实证明,我没有所有的RAM,只是试图提供一些data.table的卓越性能的证据.我相信随着更多数据,差异会更大. (2认同)
  • 有一种甚至更短的方法来写这个`data [,sum(Frequency),by = Category]`.您可以使用`.N`替换`sum()`函数.`data [,.N,by = Category]`.这是一个有用的备忘单:https://s3.amazonaws.com/assets.datacamp.com/img/blog/data+table+cheat+sheet.pdf (2认同)

Sha*_*ane 36

与这个问题有些相关.

您也可以使用by()函数:

x2 <- by(x$Frequency, x$Category, sum)
do.call(rbind,as.list(x2))
Run Code Online (Sandbox Code Playgroud)

那些其他包(plyr,reshape)具有返回data.frame的好处,但值得熟悉by(),因为它是一个基本函数.


lea*_*rnr 25

library(plyr)
ddply(tbl, .(Category), summarise, sum = sum(Frequency))
Run Code Online (Sandbox Code Playgroud)


Dav*_*urg 24

几年后,只是为了添加另一个简单的基础R解决方案,由于某些原因,这里不存在 - xtabs

xtabs(Frequency ~ Category, df)
# Category
# First Second  Third 
#    30      5     34 
Run Code Online (Sandbox Code Playgroud)

或者如果想data.frame回来

as.data.frame(xtabs(Frequency ~ Category, df))
#   Category Freq
# 1    First   30
# 2   Second    5
# 3    Third   34
Run Code Online (Sandbox Code Playgroud)


Rob*_*man 19

如果x是包含数据的数据框,则以下内容将执行您想要的操作:

require(reshape)
recast(x, Category ~ ., fun.aggregate=sum)
Run Code Online (Sandbox Code Playgroud)


joe*_*nko 17

虽然我最近成为dplyr大多数这些类型的操作的转换,但sqldf对于某些事情来说,包仍然非常好(并且恕我直言更具可读性).

以下是如何回答此问题的示例 sqldf

x <- data.frame(Category=factor(c("First", "First", "First", "Second",
                                  "Third", "Third", "Second")), 
                Frequency=c(10,15,5,2,14,20,3))

sqldf("select 
          Category
          ,sum(Frequency) as Frequency 
       from x 
       group by 
          Category")

##   Category Frequency
## 1    First        30
## 2   Second         5
## 3    Third        34
Run Code Online (Sandbox Code Playgroud)


dal*_*ogm 16

只是添加第三个选项:

require(doBy)
summaryBy(Frequency~Category, data=yourdataframe, FUN=sum)
Run Code Online (Sandbox Code Playgroud)

编辑:这是一个非常古老的答案.现在我建议使用group_by和dplyr汇总,就像在@docendo中一样.


dig*_*All 8

ave当您需要在不同的列上应用不同的聚合函数(并且您必须/想要坚持使用基础 R)时,我发现非常有帮助(且高效):

例如

鉴于此输入:

DF <-                
data.frame(Categ1=factor(c('A','A','B','B','A','B','A')),
           Categ2=factor(c('X','Y','X','X','X','Y','Y')),
           Samples=c(1,2,4,3,5,6,7),
           Freq=c(10,30,45,55,80,65,50))

> DF
  Categ1 Categ2 Samples Freq
1      A      X       1   10
2      A      Y       2   30
3      B      X       4   45
4      B      X       3   55
5      A      X       5   80
6      B      Y       6   65
7      A      Y       7   50
Run Code Online (Sandbox Code Playgroud)

我们想对Categ1和进行分组Categ2并计算 的总和Samples和均值Freq
这是一个可能的解决方案ave

# create a copy of DF (only the grouping columns)
DF2 <- DF[,c('Categ1','Categ2')]

# add sum of Samples by Categ1,Categ2 to DF2 
# (ave repeats the sum of the group for each row in the same group)
DF2$GroupTotSamples <- ave(DF$Samples,DF2,FUN=sum)

# add mean of Freq by Categ1,Categ2 to DF2 
# (ave repeats the mean of the group for each row in the same group)
DF2$GroupAvgFreq <- ave(DF$Freq,DF2,FUN=mean)

# remove the duplicates (keep only one row for each group)
DF2 <- DF2[!duplicated(DF2),]
Run Code Online (Sandbox Code Playgroud)

结果 :

> DF2
  Categ1 Categ2 GroupTotSamples GroupAvgFreq
1      A      X               6           45
2      A      Y               9           40
3      B      X               7           50
6      B      Y               6           65
Run Code Online (Sandbox Code Playgroud)


Kar*_*ius 8

另一种在矩阵或数据帧中按组返回总和的解决方案又短又快:

rowsum(x$Frequency, x$Category)
Run Code Online (Sandbox Code Playgroud)

  • 很好,而且确实很快。 (2认同)

tmf*_*mnk 7

由于dplyr 1.0.0across()可以使用该函数:

df %>%
 group_by(Category) %>%
 summarise(across(Frequency, sum))

  Category Frequency
  <chr>        <int>
1 First           30
2 Second           5
3 Third           34
Run Code Online (Sandbox Code Playgroud)

如果对多个变量感兴趣:

df %>%
 group_by(Category) %>%
 summarise(across(c(Frequency, Frequency2), sum))

  Category Frequency Frequency2
  <chr>        <int>      <int>
1 First           30         55
2 Second           5         29
3 Third           34        190
Run Code Online (Sandbox Code Playgroud)

以及使用 select helpers 选择变量:

df %>%
 group_by(Category) %>%
 summarise(across(starts_with("Freq"), sum))

  Category Frequency Frequency2 Frequency3
  <chr>        <int>      <int>      <dbl>
1 First           30         55        110
2 Second           5         29         58
3 Third           34        190        380
Run Code Online (Sandbox Code Playgroud)

样本数据:

df <- read.table(text = "Category Frequency Frequency2 Frequency3
                 1    First        10         10         20
                 2    First        15         30         60
                 3    First         5         15         30
                 4   Second         2          8         16
                 5    Third        14         70        140
                 6    Third        20        120        240
                 7   Second         3         21         42",
                 header = TRUE,
                 stringsAsFactors = FALSE)
Run Code Online (Sandbox Code Playgroud)


Man*_*kis 6

您可以使用函数group.sumRfast

Category <- Rfast::as_integer(Category,result.sort=FALSE) # convert character to numeric. R's as.numeric produce NAs.
result <- Rfast::group.sum(Frequency,Category)
names(result) <- Rfast::Sort(unique(Category)
# 30 5 34
Run Code Online (Sandbox Code Playgroud)

Rfast有很多群函数,group.sum就是其中之一。


Gra*_*non 5

使用cast而不是recast(注意'Frequency'是现在'value'

df  <- data.frame(Category = c("First","First","First","Second","Third","Third","Second")
                  , value = c(10,15,5,2,14,20,3))

install.packages("reshape")

result<-cast(df, Category ~ . ,fun.aggregate=sum)
Run Code Online (Sandbox Code Playgroud)

要得到:

Category (all)
First     30
Second    5
Third     34
Run Code Online (Sandbox Code Playgroud)