Sbu*_*g13 0 group-by r subset moving-average dplyr
这是我的数据的小代表:
Team <- rep(c("ind", "sas", "ind", "sas"),c(4,8,2,4))
Player <- c("Paul George", "David West", "Roy Hibbert",
"Paul George", "Tim Duncan", "Manuel Ginobili",
"Tony Parker", "Boris Diaw","Danny Green",
"Kawhi Leonard", "Matt Bonner", "Patty Mills",
"George Hill", "C.J.Miles","Tim Duncan",
"Manuel Ginobili", "Tony Parker", "Boris Diaw")
Team_PTS <- c(101,101,101,98,105,105,105,105,
105,105,105,105,98,98,89,89,89,128)
Date <- as.Date(c("2015-05-14", "2015-05-14", "2015-05-14",
"2015-05-16","2015-05-15", "2015-05-15", "2015-05-15",
"2015-05-15","2015-05-15", "2015-05-15", "2015-05-15",
"2015-05-15","2015-05-16","2015-05-16","2015-05-29",
"2015-05-29","2015-05-29","2015-06-03"))
Team_Gamenumber <- rep(c(1,2,1,2,2,3),c(3,1,8,2,3,1))
df <- data.frame(Team,Player,Team_PTS,Date, Team_Gamenumber)
df
Team Player Team_PTS Date Team_Gamenumber Desired_output
1 ind Paul George 101 2015-05-14 1 101
2 ind David West 101 2015-05-14 1 101
3 ind Roy Hibbert 101 2015-05-14 1 101
4 ind Paul George 98 2015-05-16 2 99.5
5 sas Tim Duncan 105 2015-05-15 1 105
6 sas Manuel Ginobili 105 2015-05-15 1 105
7 sas Tony Parker 105 2015-05-15 1 105
8 sas Boris Diaw 105 2015-05-15 1 105
9 sas Danny Green 105 2015-05-15 1 105
10 sas Kawhi Leonard 105 2015-05-15 1 105
11 sas Matt Bonner 105 2015-05-15 1 105
12 sas Patty Mills 105 2015-05-15 1 105
13 ind George Hill 98 2015-05-16 2 99.5
14 ind C.J.Miles 98 2015-05-16 2 99.5
15 sas Tim Duncan 89 2015-05-29 2 97
16 sas Manuel Ginobili 89 2015-05-29 2 97
17 sas Tony Parker 89 2015-05-29 2 97
18 sas Boris Diaw 128 2015-06-03 3 107.33
Run Code Online (Sandbox Code Playgroud)
所需的输出变量是团队点数的移动或累积平均值(本例中为sas和ind).
我试过了:
library(dplyr)
df %>% group_by(Team) %>%
mutate(cumavg_PTS = cumsum(Team_PTS) / seq_along(Team_PTS))
Run Code Online (Sandbox Code Playgroud)
然而,由于信息由玩家组织,因此产生错误的输出.看到Boris Diaw在比赛中错过了第2场比赛但在第3场比赛中出场.
此外,我认为cumsum在这种情况下不是正确的方法,因为平均值将受到每场比赛的球员数量的影响.
107.33来自前3场比赛(105 + 89 + 128)/ 3的平均值
这是另一种方式.我会用它来做data.table:
require(data.table)
setDT(df)[, cavg := { dups = !duplicated(Team_Gamenumber)
cumsum(Team_PTS * dups) / cumsum(dups)
}, by = Team]
Run Code Online (Sandbox Code Playgroud)
或者只写一个函数:
foo <- function(points, game) {
dups = !duplicated(game)
cumsum(points * dups) / cumsum(dups)
}
setDT(df)[, cavg := foo(Team_PTS, Team_Gamenumber), by = Team]
Run Code Online (Sandbox Code Playgroud)
目前仍然@bgoldst和@jeremycg的解决方案之间的差异.@ bgoldst的计算对排序的数据累计平均Team, Team_Gamenumber利用保留原来的顺序,其中作为@ jeremycg的单位计算.
例如,从您那里df换取ind = 1以下游戏编号:
setDT(df)[c(1:4,13:14), Team_Gamenumber := c(2,2,2,1,1,1)]
setDF(df)
Run Code Online (Sandbox Code Playgroud)
然后尝试两个版本.
我们可以在保留数据的原始顺序的同时获得两个答案,如下所示:
# @jeremycg's
setDT(df)[, cavg := foo(Team_PTS, Team_Gamenumber), by = Team]
# @bglodst's
setDT(df)[order(Team, Team_Gamenumber), cavg := foo(Team_PTS, Team_Gamenumber), by = Team]
Run Code Online (Sandbox Code Playgroud)