假设我有两列数据.第一个包含诸如"First","Second","Third"等类别.第二个包含代表我看到"First"的次数的数字.
例如:
Category Frequency
First 10
First 15
First 5
Second 2
Third 14
Third 20
Second 3
Run Code Online (Sandbox Code Playgroud)
我想按类别对数据进行排序并对频率求和:
Category Frequency
First 30
Second 5
Third 34
Run Code Online (Sandbox Code Playgroud)
我怎么会在R?
rcs*_*rcs 355
使用aggregate
:
aggregate(x$Frequency, by=list(Category=x$Category), FUN=sum)
Category x
1 First 30
2 Second 5
3 Third 34
Run Code Online (Sandbox Code Playgroud)
在上面的示例中,可以在中指定多个维度list
.可以通过cbind
以下方式合并相同数据类型的多个聚合度量标准:
aggregate(cbind(x$Frequency, x$Metric2, x$Metric3) ...
Run Code Online (Sandbox Code Playgroud)
(嵌入@thelatemail评论),aggregate
也有一个公式界面
aggregate(Frequency ~ Category, x, sum)
Run Code Online (Sandbox Code Playgroud)
或者,如果要聚合多个列,可以使用.
表示法(也适用于一列)
aggregate(. ~ Category, x, sum)
Run Code Online (Sandbox Code Playgroud)
或者tapply
:
tapply(x$Frequency, x$Category, FUN=sum)
First Second Third
30 5 34
Run Code Online (Sandbox Code Playgroud)
使用此数据:
x <- data.frame(Category=factor(c("First", "First", "First", "Second",
"Third", "Third", "Second")),
Frequency=c(10,15,5,2,14,20,3))
Run Code Online (Sandbox Code Playgroud)
tal*_*lat 200
最近,您还可以使用dplyr包来实现此目的:
library(dplyr)
x %>%
group_by(Category) %>%
summarise(Frequency = sum(Frequency))
#Source: local data frame [3 x 2]
#
# Category Frequency
#1 First 30
#2 Second 5
#3 Third 34
Run Code Online (Sandbox Code Playgroud)
或者,对于多个汇总列(也适用于一列):
x %>%
group_by(Category) %>%
summarise_all(funs(sum))
Run Code Online (Sandbox Code Playgroud)
更新dplyr> = 0.5: mtcars
已取代%>%
,mtcars
和%>%
家族的功能dplyr.
或者,如果您有多个要分组的列,则可以mtcars
使用逗号分隔所有这些列:
# several summary columns with arbitrary names
mtcars %>%
group_by(cyl, gear) %>% # multiple group columns
summarise(max_hp = max(hp), mean_mpg = mean(mpg)) # multiple summary columns
# summarise all columns except grouping columns using "sum"
mtcars %>%
group_by(cyl) %>%
summarise_all(sum)
# summarise all columns except grouping columns using "sum" and "mean"
mtcars %>%
group_by(cyl) %>%
summarise_all(funs(sum, mean))
# multiple grouping columns
mtcars %>%
group_by(cyl, gear) %>%
summarise_all(funs(sum, mean))
# summarise specific variables, not all
mtcars %>%
group_by(cyl, gear) %>%
summarise_at(vars(qsec, mpg, wt), funs(sum, mean))
# summarise specific variables (numeric columns except grouping columns)
mtcars %>%
group_by(gear) %>%
summarise_if(is.numeric, funs(mean))
Run Code Online (Sandbox Code Playgroud)
有关更多信息,包括%>%
运算符,请参阅dplyr简介.
asi*_*ira 66
rcs提供的答案很简单.但是,如果您正在处理更大的数据集并需要提高性能,那么可以采用更快的替代方案:
library(data.table)
data = data.table(Category=c("First","First","First","Second","Third", "Third", "Second"),
Frequency=c(10,15,5,2,14,20,3))
data[, sum(Frequency), by = Category]
# Category V1
# 1: First 30
# 2: Second 5
# 3: Third 34
system.time(data[, sum(Frequency), by = Category] )
# user system elapsed
# 0.008 0.001 0.009
Run Code Online (Sandbox Code Playgroud)
让我们使用data.frame和上面的内容将它与同一个东西进行比较:
data = data.frame(Category=c("First","First","First","Second","Third", "Third", "Second"),
Frequency=c(10,15,5,2,14,20,3))
system.time(aggregate(data$Frequency, by=list(Category=data$Category), FUN=sum))
# user system elapsed
# 0.008 0.000 0.015
Run Code Online (Sandbox Code Playgroud)
如果你想保留列,这就是语法:
data[,list(Frequency=sum(Frequency)),by=Category]
# Category Frequency
# 1: First 30
# 2: Second 5
# 3: Third 34
Run Code Online (Sandbox Code Playgroud)
对于较大的数据集,差异将变得更加明显,如下面的代码所示:
data = data.table(Category=rep(c("First", "Second", "Third"), 100000),
Frequency=rnorm(100000))
system.time( data[,sum(Frequency),by=Category] )
# user system elapsed
# 0.055 0.004 0.059
data = data.frame(Category=rep(c("First", "Second", "Third"), 100000),
Frequency=rnorm(100000))
system.time( aggregate(data$Frequency, by=list(Category=data$Category), FUN=sum) )
# user system elapsed
# 0.287 0.010 0.296
Run Code Online (Sandbox Code Playgroud)
对于多个聚合,您可以组合lapply
并按.SD
如下方式进行组合
data[, lapply(.SD, sum), by = Category]
# Category Frequency
# 1: First 30
# 2: Second 5
# 3: Third 34
Run Code Online (Sandbox Code Playgroud)
Sha*_*ane 36
您也可以使用by()函数:
x2 <- by(x$Frequency, x$Category, sum)
do.call(rbind,as.list(x2))
Run Code Online (Sandbox Code Playgroud)
那些其他包(plyr,reshape)具有返回data.frame的好处,但值得熟悉by(),因为它是一个基本函数.
lea*_*rnr 25
library(plyr)
ddply(tbl, .(Category), summarise, sum = sum(Frequency))
Run Code Online (Sandbox Code Playgroud)
Dav*_*urg 24
几年后,只是为了添加另一个简单的基础R解决方案,由于某些原因,这里不存在 - xtabs
xtabs(Frequency ~ Category, df)
# Category
# First Second Third
# 30 5 34
Run Code Online (Sandbox Code Playgroud)
或者如果想data.frame
回来
as.data.frame(xtabs(Frequency ~ Category, df))
# Category Freq
# 1 First 30
# 2 Second 5
# 3 Third 34
Run Code Online (Sandbox Code Playgroud)
Rob*_*man 19
如果x
是包含数据的数据框,则以下内容将执行您想要的操作:
require(reshape)
recast(x, Category ~ ., fun.aggregate=sum)
Run Code Online (Sandbox Code Playgroud)
joe*_*nko 17
虽然我最近成为dplyr
大多数这些类型的操作的转换,但sqldf
对于某些事情来说,包仍然非常好(并且恕我直言更具可读性).
以下是如何回答此问题的示例 sqldf
x <- data.frame(Category=factor(c("First", "First", "First", "Second",
"Third", "Third", "Second")),
Frequency=c(10,15,5,2,14,20,3))
sqldf("select
Category
,sum(Frequency) as Frequency
from x
group by
Category")
## Category Frequency
## 1 First 30
## 2 Second 5
## 3 Third 34
Run Code Online (Sandbox Code Playgroud)
dal*_*ogm 16
只是添加第三个选项:
require(doBy)
summaryBy(Frequency~Category, data=yourdataframe, FUN=sum)
Run Code Online (Sandbox Code Playgroud)
编辑:这是一个非常古老的答案.现在我建议使用group_by和dplyr汇总,就像在@docendo中一样.
ave
当您需要在不同的列上应用不同的聚合函数(并且您必须/想要坚持使用基础 R)时,我发现非常有帮助(且高效):
例如
鉴于此输入:
DF <-
data.frame(Categ1=factor(c('A','A','B','B','A','B','A')),
Categ2=factor(c('X','Y','X','X','X','Y','Y')),
Samples=c(1,2,4,3,5,6,7),
Freq=c(10,30,45,55,80,65,50))
> DF
Categ1 Categ2 Samples Freq
1 A X 1 10
2 A Y 2 30
3 B X 4 45
4 B X 3 55
5 A X 5 80
6 B Y 6 65
7 A Y 7 50
Run Code Online (Sandbox Code Playgroud)
我们想对Categ1
和进行分组Categ2
并计算 的总和Samples
和均值Freq
。
这是一个可能的解决方案ave
:
# create a copy of DF (only the grouping columns)
DF2 <- DF[,c('Categ1','Categ2')]
# add sum of Samples by Categ1,Categ2 to DF2
# (ave repeats the sum of the group for each row in the same group)
DF2$GroupTotSamples <- ave(DF$Samples,DF2,FUN=sum)
# add mean of Freq by Categ1,Categ2 to DF2
# (ave repeats the mean of the group for each row in the same group)
DF2$GroupAvgFreq <- ave(DF$Freq,DF2,FUN=mean)
# remove the duplicates (keep only one row for each group)
DF2 <- DF2[!duplicated(DF2),]
Run Code Online (Sandbox Code Playgroud)
结果 :
> DF2
Categ1 Categ2 GroupTotSamples GroupAvgFreq
1 A X 6 45
2 A Y 9 40
3 B X 7 50
6 B Y 6 65
Run Code Online (Sandbox Code Playgroud)
另一种在矩阵或数据帧中按组返回总和的解决方案又短又快:
rowsum(x$Frequency, x$Category)
Run Code Online (Sandbox Code Playgroud)
由于dplyr 1.0.0
,across()
可以使用该函数:
df %>%
group_by(Category) %>%
summarise(across(Frequency, sum))
Category Frequency
<chr> <int>
1 First 30
2 Second 5
3 Third 34
Run Code Online (Sandbox Code Playgroud)
如果对多个变量感兴趣:
df %>%
group_by(Category) %>%
summarise(across(c(Frequency, Frequency2), sum))
Category Frequency Frequency2
<chr> <int> <int>
1 First 30 55
2 Second 5 29
3 Third 34 190
Run Code Online (Sandbox Code Playgroud)
以及使用 select helpers 选择变量:
df %>%
group_by(Category) %>%
summarise(across(starts_with("Freq"), sum))
Category Frequency Frequency2 Frequency3
<chr> <int> <int> <dbl>
1 First 30 55 110
2 Second 5 29 58
3 Third 34 190 380
Run Code Online (Sandbox Code Playgroud)
样本数据:
df <- read.table(text = "Category Frequency Frequency2 Frequency3
1 First 10 10 20
2 First 15 30 60
3 First 5 15 30
4 Second 2 8 16
5 Third 14 70 140
6 Third 20 120 240
7 Second 3 21 42",
header = TRUE,
stringsAsFactors = FALSE)
Run Code Online (Sandbox Code Playgroud)
您可以使用函数group.sum
从包Rfast。
Category <- Rfast::as_integer(Category,result.sort=FALSE) # convert character to numeric. R's as.numeric produce NAs.
result <- Rfast::group.sum(Frequency,Category)
names(result) <- Rfast::Sort(unique(Category)
# 30 5 34
Run Code Online (Sandbox Code Playgroud)
Rfast有很多群函数,group.sum
就是其中之一。
使用cast
而不是recast
(注意'Frequency'
是现在'value'
)
df <- data.frame(Category = c("First","First","First","Second","Third","Third","Second")
, value = c(10,15,5,2,14,20,3))
install.packages("reshape")
result<-cast(df, Category ~ . ,fun.aggregate=sum)
Run Code Online (Sandbox Code Playgroud)
要得到:
Category (all)
First 30
Second 5
Third 34
Run Code Online (Sandbox Code Playgroud)
归档时间: |
|
查看次数: |
495506 次 |
最近记录: |