我经常在 R 中处理体育数据,并在尝试计算摘要统计数据时遇到与 dplyr::group_by() 相同的问题。我有以下数据框,其中包含世界杯小组赛每场比赛的预测分数:
dput(worldcup.df)
structure(list(teamA_name = c("Russia", "Egypt", "Morocco", "Portugal",
"France", "Argentina", "Peru", "Croatia", "Costa Rica", "Germany",
"Brazil", "Sweden", "Belgium", "Tunisia", "Colombia", "Poland",
"Russia", "Portugal", "Uruguay", "Iran", "Denmark", "France",
"Argentina", "Brazil", "Nigeria", "Serbia", "Belgium", "Korea Republic",
"Germany", "England", "Japan", "Poland", "Uruguay", "Saudi Arabia",
"Iran", "Spain", "Denmark", "Australia", "Nigeria", "Iceland",
"Mexico", "Korea Republic", "Serbia", "Switzerland", "Japan",
"Senegal", "Panama", "England"), teamB_name = c("Saudi Arabia",
"Uruguay", "Iran", "Spain", "Australia", "Iceland", "Denmark",
"Nigeria", "Serbia", "Mexico", "Switzerland", "Korea Republic",
"Panama", "England", "Japan", "Senegal", "Egypt", "Morocco",
"Saudi Arabia", "Spain", "Australia", "Peru", "Croatia", "Costa Rica",
"Iceland", "Switzerland", "Tunisia", "Mexico", "Sweden", "Panama",
"Senegal", "Colombia", "Russia", "Egypt", "Portugal", "Morocco",
"France", "Peru", "Argentina", "Croatia", "Sweden", "Germany",
"Brazil", "Costa Rica", "Poland", "Colombia", "Tunisia", "Belgium"
), epA = c(1.64, 0.7051, 1.1294, 1.1116, 2.1962, 1.984, 1.5765,
1.865, 1.2845, 2.0889, 2.1384, 1.5034, 2.1706, 0.5859, 2.1741,
1.6272, 1.4941, 2.1482, 2.2089, 0.635, 1.7694, 1.6016, 1.7816,
2.4745, 1.0762, 1.0326, 2.198, 1.0414, 2.2583, 2.198, 1.1264,
1.0471, 1.9565, 1.2201, 0.8364, 2.3633, 0.9337, 0.7922, 0.5665,
1.1593, 1.5544, 0.4698, 0.4331, 1.7843, 0.8872, 0.8157, 1.3932,
1.3932), epB = c(1.094, 2.0809, 1.6016, 1.6204, 0.6098, 0.787,
1.1535, 0.89, 1.4405, 0.6981, 0.6576, 1.2226, 0.6304, 2.2251,
0.6279, 1.1058, 1.2319, 0.6488, 0.5991, 2.165, 0.9756, 1.1294,
0.9644, 0.3895, 1.6588, 1.7064, 0.608, 1.6966, 0.5597, 0.608,
1.6046, 1.6909, 0.8105, 1.5069, 1.9266, 0.4757, 1.8163, 1.9778,
2.2495, 1.5697, 1.1746, 2.3712, 2.4179, 0.9617, 1.8688, 1.9503,
1.3308, 1.3308)), .Names = c("teamA_name", "teamB_name", "epA",
"epB"), class = "data.frame", row.names = c(NA, -48L))
head(worldcup.df)
teamA_name teamB_name epA epB
1 Russia Saudi Arabia 1.6400 1.0940
2 Egypt Uruguay 0.7051 2.0809
3 Morocco Iran 1.1294 1.6016
4 Portugal Spain 1.1116 1.6204
5 France Australia 2.1962 0.6098
6 Argentina Iceland 1.9840 0.7870
Run Code Online (Sandbox Code Playgroud)
我已经计算了 epA 和 epB 作为 A 队和 B 队在每场比赛中的预期得分,现在我想做一个 group_by() 来计算 32 支球队中每支球队的总预期得分。我历史上所做的事情是这样的:
asAgroupby = worldcup.df %>%
dplyr::group_by(teamA_name) %>%
dplyr::summarise(totPts = sum(epA))
asBgroupby = worldcup.df %>%
dplyr::group_by(teamB_name) %>%
dplyr::summarise(totPts = sum(epB))
outputdf = asAgroupby %>%
dplyr::left_join(asBgroupby, by = c('teamA_name'='teamB_name')) %>%
dplyr::mutate(totPts = totPts.x + totPts.y) %>%
dplyr::select(-one_of(c('totPts.x', 'totPts.y')))
Run Code Online (Sandbox Code Playgroud)
对 teamA 和 teamB 列进行 2 个单独的 group_by() 调用,然后调用 left_join,然后对列求和并删除多余的列...糟糕。这是一个简单的情况,就像这个问题一样:正好有 4 列(2 个标识列和 2 个统计列)。由于大量的体育数据都有主/客队的列,因此这是一个常见问题。
我觉得我需要 1 个数据帧,其行数为 2 倍,列数为 1/2,这样我就可以进行一组操作。如有任何帮助,我们将不胜感激,谢谢!
编辑:worldcup.df 是由 dplyr 函数的长 %>% 构建的 - 如果可以在不创建新变量的情况下完成此操作,那么奖励积分,而只是:
worldcup.df %>%
...
Run Code Online (Sandbox Code Playgroud)
这是一个tidyverse通过将数据重新格式化为长格式来工作的工作流程。它确实会跟踪谁在同一场比赛中 ( game_id),以及他们是 A 队还是 B 队 - 如果这有用的话。(平心而论,这与@Emil 的基本思想相同,只是实现它的工作流程不同。)
worldcup.long <- worldcup.df %>%
as_data_frame() %>%
mutate(game_id = 1:n()) %>%
gather(key, value, - game_id) %>%
mutate(
AB = str_extract(key, "A|B"),
key = str_extract(key, "team|ep")
) %>%
spread(key, value,convert = TRUE)
outputdf <- worldcup.long %>%
group_by(team) %>%
summarize(totPts = sum(ep))
Run Code Online (Sandbox Code Playgroud)