我有一些像这样的数据集:
# date # value class
1984-04-01 95.32384 A
1984-04-01 39.86818 B
1984-07-01 43.57983 A
1984-07-01 10.83754 B
Run Code Online (Sandbox Code Playgroud)
现在我想按数据对数据进行分组,并从A类中减去B类的值.我研究了ddply,总结,融合和聚合,但不能完全得到我想要的东西.有办法轻松完成吗?请注意,我每个日期都有两个值,一个是A类,另一个是B类.我的意思是我可以将它重新排列成两个dfs,按日期和类顺序排列并再次合并,但我觉得还有更多的R方式去做吧.
假设这个数据框(在Prasad的帖子中生成,但具有set.seed可重复性):
set.seed(123)
DF <- data.frame( date = rep(seq(as.Date('1984-04-01'),
as.Date('1984-04-01') + 3, by=1),
1, each=2),
class = rep(c('A','B'), 4),
value = sample(1:8))
Run Code Online (Sandbox Code Playgroud)
那么我们考虑七种解决方案
1)动物园可以给我们一个单行解决方案(不包括library声明):
library(zoo)
z <- with(read.zoo(DF, split = 2), A - B)
Run Code Online (Sandbox Code Playgroud)
给这个zoo系列:
> z
1984-04-01 1984-04-02 1984-04-03 1984-04-04
-3 3 3 -5
Run Code Online (Sandbox Code Playgroud)
还要注意as.data.frame(z)或data.frame(time = time(z), value = coredata(z))给出一个数据框; 但是,您可能希望将其保留为动物园对象,因为它是一个时间序列,并且在此表单中可以更方便地对其进行其他操作,例如plot(z)
2)sqldf还可以提供一个语句解决方案(除了library调用):
> library(sqldf)
> sqldf("select date, sum(((class = 'A') - (class = 'B')) * value) as value
+ from DF group by date")
date value
1 1984-04-01 -3
2 1984-04-02 3
3 1984-04-03 3
4 1984-04-04 -5
Run Code Online (Sandbox Code Playgroud)
3)tapply可以作为受sqldf解决方案启发的解决方案的基础:
> with(DF, tapply(((class =="A") - (class == "B")) * value, date, sum))
1984-04-01 1984-04-02 1984-04-03 1984-04-04
-3 3 3 -5
Run Code Online (Sandbox Code Playgroud)
4)聚合可以以sqldf与tapply上面相同的方式使用(尽管aggregate已经出现了稍微不同的解决方案):
> aggregate(((DF$class=="A") - (DF$class=="B")) * DF["value"], DF["date"], sum)
date value
1 1984-04-01 -3
2 1984-04-02 3
3 1984-04-03 3
4 1984-04-04 -5
Run Code Online (Sandbox Code Playgroud)
5)summary来自doBy包可以提供另一种解决方案,虽然它确实需要一个transform帮助它:
> library(doBy)
> summaryBy(value ~ date, transform(DF, value = ((class == "A") - (class == "B")) * value), FUN = sum, keep.names = TRUE)
date value
1 1984-04-01 -3
2 1984-04-02 3
3 1984-04-03 3
4 1984-04-04 -5
Run Code Online (Sandbox Code Playgroud)
6)再混合从混音包可以做到这一点,但以transform并设有特别漂亮输出:
> library(remix)
> remix(value ~ date, transform(DF, value = ((class == "A") - (class == "B")) * value), sum)
value ~ date
============
+------+------------+-------+-----+
| | sum |
+======+============+=======+=====+
| date | 1984-04-01 | value | -3 |
+ +------------+-------+-----+
| | 1984-04-02 | value | 3 |
+ +------------+-------+-----+
| | 1984-04-03 | value | 3 |
+ +------------+-------+-----+
| | 1984-04-04 | value | -5 |
+------+------------+-------+-----+
Run Code Online (Sandbox Code Playgroud)
7)Hmisc包中的summary.formula也有很好的输出:
> library(Hmisc)
> summary(value ~ date, data = transform(DF, value = ((class == "A") - (class == "B")) * value), fun = sum, overall = FALSE)
value N=8
+----+----------+-+-----+
| | |N|value|
+----+----------+-+-----+
|date|1984-04-01|2|-3 |
| |1984-04-02|2| 3 |
| |1984-04-03|2| 3 |
| |1984-04-04|2|-5 |
+----+----------+-+-----+
Run Code Online (Sandbox Code Playgroud)
我能想到的最简单的方法是使用dcast从reshape2包装,打造一个数据帧,每行和列的一个日期A和B,然后用transform做A-B:
df <- data.frame( date = rep(seq(as.Date('1984-04-01'),
as.Date('1984-04-01') + 3, by=1),
1, each=2),
class = rep(c('A','B'), 4),
value = sample(1:8))
require(reshape2)
df_wide <- dcast(df, date ~ class, value_var = 'value')
> df_wide
date A B
1 1984-04-01 8 7
2 1984-04-02 6 1
3 1984-04-03 3 4
4 1984-04-04 5 2
> transform( df_wide, A_B = A - B )
date A B A_B
1 1984-04-01 8 7 1
2 1984-04-02 6 1 5
3 1984-04-03 3 4 -1
4 1984-04-04 5 2 3
Run Code Online (Sandbox Code Playgroud)
在基础R中,我会通过使用aggregate和来解决问题sum.这通过将B类的每个值转换为负值来实现:
(使用@PrasadChalasani提供的数据)
df <- within(df, value[class=="B"] <- -value[class=="B"])
aggregate(df$value, by=list(date=df$date), sum)
date x
1 1984-04-01 3
2 1984-04-02 2
3 1984-04-03 2
4 1984-04-04 1
Run Code Online (Sandbox Code Playgroud)