移动平均线具有多个GroupBy

Sbu*_*g13 0 group-by r subset moving-average dplyr

这是我的数据的小代表:

Team <- rep(c("ind", "sas", "ind", "sas"),c(4,8,2,4))

Player <- c("Paul George", "David West", "Roy Hibbert",
            "Paul George", "Tim Duncan", "Manuel Ginobili",
            "Tony Parker", "Boris Diaw","Danny Green", 
            "Kawhi Leonard", "Matt Bonner", "Patty Mills",
            "George Hill", "C.J.Miles","Tim Duncan",
            "Manuel Ginobili", "Tony Parker", "Boris Diaw")

Team_PTS <- c(101,101,101,98,105,105,105,105,
              105,105,105,105,98,98,89,89,89,128)

Date <- as.Date(c("2015-05-14", "2015-05-14", "2015-05-14",
               "2015-05-16","2015-05-15", "2015-05-15", "2015-05-15",
               "2015-05-15","2015-05-15", "2015-05-15", "2015-05-15",
               "2015-05-15","2015-05-16","2015-05-16","2015-05-29",
               "2015-05-29","2015-05-29","2015-06-03"))

Team_Gamenumber <- rep(c(1,2,1,2,2,3),c(3,1,8,2,3,1))

df <- data.frame(Team,Player,Team_PTS,Date, Team_Gamenumber)

df

   Team          Player Team_PTS       Date Team_Gamenumber Desired_output
1   ind     Paul George      101 2015-05-14               1            101
2   ind      David West      101 2015-05-14               1            101
3   ind     Roy Hibbert      101 2015-05-14               1            101
4   ind     Paul George       98 2015-05-16               2           99.5
5   sas      Tim Duncan      105 2015-05-15               1            105
6   sas Manuel Ginobili      105 2015-05-15               1            105
7   sas     Tony Parker      105 2015-05-15               1            105
8   sas      Boris Diaw      105 2015-05-15               1            105
9   sas     Danny Green      105 2015-05-15               1            105
10  sas   Kawhi Leonard      105 2015-05-15               1            105
11  sas     Matt Bonner      105 2015-05-15               1            105
12  sas     Patty Mills      105 2015-05-15               1            105
13  ind     George Hill       98 2015-05-16               2           99.5
14  ind       C.J.Miles       98 2015-05-16               2           99.5
15  sas      Tim Duncan       89 2015-05-29               2             97
16  sas Manuel Ginobili       89 2015-05-29               2             97
17  sas     Tony Parker       89 2015-05-29               2             97
18  sas      Boris Diaw      128 2015-06-03               3         107.33
Run Code Online (Sandbox Code Playgroud)

所需的输出变量是团队点数的移动或累积平均值(本例中为sas和ind).

我试过了:

library(dplyr)
df %>% group_by(Team) %>%
       mutate(cumavg_PTS = cumsum(Team_PTS) / seq_along(Team_PTS))
Run Code Online (Sandbox Code Playgroud)

然而,由于信息由玩家组织,因此产生错误的输出.看到Boris Diaw在比赛中错过了第2场比赛但在第3场比赛中出场.

此外,我认为cumsum在这种情况下不是正确的方法,因为平均值将受到每场比赛的球员数量的影响.

107.33来自前3场比赛(105 + 89 + 128)/ 3的平均值

Aru*_*run 5

这是另一种方式.我会用它来做data.table:

require(data.table)
setDT(df)[, cavg := { dups = !duplicated(Team_Gamenumber)
                      cumsum(Team_PTS * dups) / cumsum(dups)
                    }, by = Team]
Run Code Online (Sandbox Code Playgroud)

或者只写一个函数:

foo <- function(points, game) {
    dups = !duplicated(game)
    cumsum(points * dups) / cumsum(dups)
}
setDT(df)[, cavg := foo(Team_PTS, Team_Gamenumber), by = Team]
Run Code Online (Sandbox Code Playgroud)

目前仍然@bgoldst和@jeremycg的解决方案之间的差异.@ bgoldst的计算对排序的数据累计平均Team, Team_Gamenumber利用保留原来的顺序,其中作为@ jeremycg的单位计算.

例如,从您那里df换取ind = 1以下游戏编号:

setDT(df)[c(1:4,13:14), Team_Gamenumber := c(2,2,2,1,1,1)]
setDF(df)
Run Code Online (Sandbox Code Playgroud)

然后尝试两个版本.


我们可以在保留数据的原始顺序的同时获得两个答案,如下所示:

# @jeremycg's
setDT(df)[, cavg := foo(Team_PTS, Team_Gamenumber), by = Team]
# @bglodst's
setDT(df)[order(Team, Team_Gamenumber), cavg := foo(Team_PTS, Team_Gamenumber), by = Team]
Run Code Online (Sandbox Code Playgroud)