Sum重复然后删除除第一次出现之外的所有内容

Chr*_*ris 1 r plyr

我有一个数据框(~5000行,6列),包含一些id变量的重复值.我有另一个连续变量x,我想为每个副本求和id.观察是时间依赖的,有yearmonth变量,我想按时间顺序保持每个副本的第一次观察,id并在第一次观察中添加随后的欺骗.

我已经包含了类似于我的虚拟数据:dat1.我还包括一个数据集,显示了我期望结果的结构:outcome.

我尝试了两种策略,这两种策略都没有给我我想要的东西(见下文).第一个策略为我提供了正确的值x,但是我放弃了我的年份和月份列 - 我需要为所有第一个重复id值保留这些值.第二种策略没有x正确地加总值.

如何获得我想要的结果的任何建议将不胜感激.

# dummy data set
set.seed(179)
dat1 <- data.frame(id = c(1234, 1321, 4321, 7423, 4321, 8503, 2961, 1234, 8564, 1234),
                   year = rep(c("2006", "2007"), each = 5),
                   month = rep(c("December", "January"), each = 5),
                   x = round(rnorm(10, 10, 3), 2))

# desired outcome
outcome <- data.frame(id = c(1234, 1321, 4321, 7423, 8503, 2961, 8564),
                      year = c(rep("2006", 4), rep("2007", 3)),
                      month = c(rep("December", 4), rep("January", 3)),
                      x = c(36.42, 11.55, 17.31, 5.97, 12.48, 10.22, 11.41))

# strategy 1:
library(plyr)
dat2 <- ddply(dat1, .(id), summarise, x = sum(x))

# strategy 2:
# partition into two data frames - one with unique cases, one with dupes
dat1_unique <- dat1[!duplicated(dat1$id), ]
dat1_dupes <- dat1[duplicated(dat1$id), ]

# merge these data frames while summing the x variable for duplicated ids
# with plyr
dat3 <- ddply(merge(dat1_unique, dat1_dupes, all.x = TRUE),
              .(id), summarise, x = sum(x))
# in base R
dat4 <- aggregate(x ~ id, data = merge(dat1_unique, dat1_dupes,
                  all.x = TRUE), FUN = sum)
Run Code Online (Sandbox Code Playgroud)

42-*_*42- 5

我有不同的总和,但它是b/c我忘了种子:

> dat1$x <- ave(dat1$x, dat1$id, FUN=sum)
> dat1[!duplicated(dat1$id), ]
    id year    month     x
1 1234 2006 December 25.18
2 1321 2006 December 15.06
3 4321 2006 December 15.50
4 7423 2006 December  7.16
6 8503 2007  January 13.23
7 2961 2007  January  7.38
9 8564 2007  January  7.21
Run Code Online (Sandbox Code Playgroud)

(为了更安全在副本上工作会更好.您可能需要添加一个订购步骤.)