我反复使用的设计模式之一是在数据帧上执行"分组依据"或"拆分,应用,组合(SAC)",然后将聚合数据连接回原始数据.例如,当计算每个县与具有许多州和县的数据框中的州平均值的偏差时,这很有用.我的总计算很少只是一个简单的意思,但它就是一个很好的例子.我经常通过以下方式解决这个问题:
require(plyr)
set.seed(1)
## set up some data
group1 <- rep(1:3, 4)
group2 <- sample(c("A","B","C"), 12, rep=TRUE)
values <- rnorm(12)
df <- data.frame(group1, group2, values)
## got some data, so let's aggregate
group1Mean <- ddply( df, "group1", function(x)
data.frame( meanValue = mean(x$values) ) )
df <- merge( df, group1Mean )
df
Run Code Online (Sandbox Code Playgroud)
这产生了如下的良好聚合数据:
> df
group1 group2 values meanValue
1 1 A 0.48743 -0.121033
2 1 A -0.04493 -0.121033
3 1 C -0.62124 -0.121033
4 1 C -0.30539 -0.121033
5 2 A 1.51178 0.004804
6 2 B 0.73832 0.004804
7 2 A -0.01619 0.004804
8 2 B -2.21470 0.004804
9 3 B 1.12493 0.758598
10 3 C 0.38984 0.758598
11 3 B 0.57578 0.758598
12 3 A 0.94384 0.758598
Run Code Online (Sandbox Code Playgroud)
这有效,但有没有其他方法可以提高可读性,性能等?
And*_*rie 18
一行代码可以解决问题:
new <- ddply( df, "group1", transform, numcolwise(mean))
new
group1 group2 values meanValue
1 1 A 0.48742905 -0.121033381
2 1 A -0.04493361 -0.121033381
3 1 C -0.62124058 -0.121033381
4 1 C -0.30538839 -0.121033381
5 2 A 1.51178117 0.004803931
6 2 B 0.73832471 0.004803931
7 2 A -0.01619026 0.004803931
8 2 B -2.21469989 0.004803931
9 3 B 1.12493092 0.758597929
10 3 C 0.38984324 0.758597929
11 3 B 0.57578135 0.758597929
12 3 A 0.94383621 0.758597929
identical(df, new)
[1] TRUE
Run Code Online (Sandbox Code Playgroud)
Rei*_*son 13
我觉得ave()这里比你表示的plyr电话更有用(我对plyr不太熟悉,知道你是否可以直接用plyr做你想做的事情,如果你不能,我会感到惊讶!)或其他基础R替代品(aggregate(),tapply()):
> with(df, ave(values, group1, FUN = mean))
[1] -0.121033381 0.004803931 0.758597929 -0.121033381 0.004803931
[6] 0.758597929 -0.121033381 0.004803931 0.758597929 -0.121033381
[11] 0.004803931 0.758597929
Run Code Online (Sandbox Code Playgroud)
您可以使用within()或transform()将此结果直接嵌入到df:
> df2 <- within(df, meanValue <- ave(values, group1, FUN = mean))
> head(df2)
group1 group2 values meanValue
1 1 A 0.4874291 -0.121033381
2 2 B 0.7383247 0.004803931
3 3 B 0.5757814 0.758597929
4 1 C -0.3053884 -0.121033381
5 2 A 1.5117812 0.004803931
6 3 C 0.3898432 0.758597929
> df3 <- transform(df, meanValue = ave(values, group1, FUN = mean))
> all.equal(df2,df3)
[1] TRUE
Run Code Online (Sandbox Code Playgroud)
如果排序很重要:
> head(df2[order(df2$group1, df2$group2), ])
group1 group2 values meanValue
1 1 A 0.48742905 -0.121033381
10 1 A -0.04493361 -0.121033381
4 1 C -0.30538839 -0.121033381
7 1 C -0.62124058 -0.121033381
5 2 A 1.51178117 0.004803931
11 2 A -0.01619026 0.004803931
Run Code Online (Sandbox Code Playgroud)
fra*_*nkc 13
在性能方面,您可以使用data.table包进行相同类型的操作,该包具有内置聚合,并且由于索引和基于C的实现而非常快速.例如,df已经从您的示例中存在:
library("data.table")
dt<-as.data.table(df)
setkey(dt,group1)
dt<-dt[,list(group2,values,meanValue=mean(values)),by=group1]
dt
group1 group2 values meanValue
[1,] 1 A 0.82122120 0.18810771
[2,] 1 C 0.78213630 0.18810771
[3,] 1 C 0.61982575 0.18810771
[4,] 1 A -1.47075238 0.18810771
[5,] 2 B 0.59390132 0.03354688
[6,] 2 A 0.07456498 0.03354688
[7,] 2 B -0.05612874 0.03354688
[8,] 2 A -0.47815006 0.03354688
[9,] 3 B 0.91897737 -0.20205707
[10,] 3 C -1.98935170 -0.20205707
[11,] 3 B -0.15579551 -0.20205707
[12,] 3 A 0.41794156 -0.20205707
Run Code Online (Sandbox Code Playgroud)
I have not benchmarked it, but in my experience it is a lot faster.
If you decide to go down the data.table road, which I think is worth exploring if you work with large data sets, you really need to read the docs because there are some differences from data frame that can bite you if you are unaware of them. However, notably data.table generally does work with any function expecting a data frame,as a data.table will claim its type is data frame (data table inherits from data frame).
[ Feb 2011 ]
[ Aug 2012 ] Update from Matthew :
New in v1.8.2 released to CRAN in July 2012 is :=按组.这非常类似于上面的答案,但增加了新列引用到dt所以没有副本,而不需要进行合并操作或恢复上市的现有列到一起总回报.没有必要setkey首先,它应对非连续的组(即未组合在一起的组).
对于大型数据集,这显着更快,并且具有简单和短的语法:
dt <- as.data.table(df)
dt[, meanValue := mean(values), by = group1]
Run Code Online (Sandbox Code Playgroud)
你不能只添加x到你传递给的功能ddply吗?
df <- ddply( df, "group1", function(x)
data.frame( x, meanValue = mean(x$values) ) )
Run Code Online (Sandbox Code Playgroud)