我有一个包含以下列的数据框:
> colnames(my.dataframe)
[1] "id" "firstName" "lastName"
[4] "position" "jerseyNumber" "currentTeamId"
[7] "currentTeamAbbreviation" "currentRosterStatus" "height"
[10] "weight" "birthDate" "age"
[13] "birthCity" "birthCountry" "rookie"
[16] "handednessShoots" "college" "twitter"
[19] "currentInjuryDescription" "currentInjuryPlayingProbability" "teamId"
[22] "teamAbbreviation" "fg2PtAtt" "fg3PtAtt"
[25] "fg2PtMade" "fg3PtMade" "ftMade"
[28] "fg2PtPct" "fg3PtPct" "ftPct"
[31] "ast" "tov" "offReb"
[34] "foulsDrawn" "blkAgainst" "plusMinus"
[37] "minSeconds"
Run Code Online (Sandbox Code Playgroud)
这是我的代码不起作用:
my.dataframe %>%
dplyr::group_by(id) %>%
dplyr::summarise_at(vars(firstName:currentInjuryPlayingProbability), funs(min), na.rm = TRUE) %>%
dplyr::summarise_at(vars(fg2PtAtt:minSeconds), funs(sum), na.rm = TRUE) %>%
vars(), funs(min), na.rm = TRUE) %>%
dplyr::summarise(teamId = paste(teamId), teamAbbreviation = paste(teamAbbreviation))
Run Code Online (Sandbox Code Playgroud)
首先,我按id分组(尽管它被称为id,但它并不是我数据框中的唯一列)。对于直到currentInjuryPlayingProbability之前的接下来的19列,当按ID分组时,这些列始终是相同的,因此我使用该min函数来汇总/获取值。
接下来,我想fg2PtAtt用平均值总结从头到尾的所有列(这些列都是数字/整数)。
最后,对于teamId和teamAbbreviation列(grouped_by id时不相同),我想将它们粘贴到单个字符串中,每个字符串都具有摘要。
我的方法行不通,因为我认为我不能先调用summarise_at,再调用另一个summarise_at,再调用summarise。到第二个summarise_at调用时,试图汇总的列已被第一个summarise_at删除
感谢您提供任何帮助!我将在不久的将来更新我的数据框的子集,以测试代码。
编辑:
dput(my.dataframe)
structure(list(id = c(10138L, 9466L, 9360L, 9360L), firstName = c("Alex",
"Quincy", "Luke", "Luke"), lastName = c("Abrines", "Acy", "Babbitt",
"Babbitt"), currentInjuryPlayingProbability = c(NA_character_,
NA_character_, NA_character_, NA_character_), teamId = c(96L,
84L, 91L, 92L), teamAbbreviation = c("OKL", "BRO", "ATL", "MIA"
), fg2PtAtt = c(70L, 73L, 57L, 2L), fg3PtAtt = c(221L, 292L,
111L, 45L), minSeconds = c(67637L, 81555L, 34210L, 8676L)), row.names = c(NA,
-4L), class = c("tbl_df", "tbl", "data.frame"))
my.dataframe
id firstName lastName currentInjuryPlayingProbability teamId teamAbbreviation fg2PtAtt fg3PtAtt minSeconds
<int> <chr> <chr> <chr> <int> <chr> <int> <int> <int>
1 10138 Alex Abrines <NA> 96 OKL 70 221 67637
2 9466 Quincy Acy <NA> 84 BRO 73 292 81555
3 9360 Luke Babbitt <NA> 91 ATL 57 111 34210
4 9360 Luke Babbitt <NA> 92 MIA 2 45 8676
Run Code Online (Sandbox Code Playgroud)
这是一个只有9列的简短示例,但有足够的数据来突出问题。结果数据框应如下所示:
id firstName lastName currentInjuryPlayingProbability teamId teamAbbreviation fg2PtAtt fg3PtAtt minSeconds
<int> <chr> <chr> <chr> <chr> <chr> <int> <int> <int>
1 10138 Alex Abrines <NA> 96 OKL 70 221 67637
2 9466 Quincy Acy <NA> 84 BRO 73 292 81555
3 9360 Luke Babbitt <NA> 91, 92 ATL, MIA 57 156 42886
Run Code Online (Sandbox Code Playgroud)
我认为这是完成此特定任务的最简单方法,至少与我见过的一些类似map2/ reduce解决方案相比。
首先要指出的是,如果min由于认为分组变量的每个值都应该相同而使用它来获取值,则只需将其添加到分组中即可。然后它会自动保存。
其次,您可以使用{}覆盖LHS的自动放置%>%到RHS的第一个参数中。这样一来,您就可以应用不同的转换并重新组合它们。通常,您不需要这样做,因为占位符.会为您完成此操作,但是如果占位符不是RHS的明文,则有时会需要它。(我确定我阅读了一些描述确切规则的资源,但现在找不到)。
第三是因为您知道summarise将删除除分组变量之外未选择的列,因此left_join将自动使用共享的列名称进行联接。
这意味着我们可以执行以下操作,我认为这很干净。但是,如果转换开始变得特别复杂(例如,如果内部有管道,则left_join我建议为最终输出的每一部分赋予其自己的赋值和名称,以使其更清楚。如果您需要多个摘要,则还需要注意(如均值和标准差),因为名称会相互冲突。
library(tidyverse)
my_dataframe <- structure(list(id = c(10138L, 9466L, 9360L, 9360L), firstName = c("Alex", "Quincy", "Luke", "Luke"), lastName = c("Abrines", "Acy", "Babbitt", "Babbitt"), currentInjuryPlayingProbability = c(NA_character_, NA_character_, NA_character_, NA_character_), teamId = c(96L, 84L, 91L, 92L), teamAbbreviation = c("OKL", "BRO", "ATL", "MIA"), fg2PtAtt = c(70L, 73L, 57L, 2L), fg3PtAtt = c(221L, 292L, 111L, 45L), minSeconds = c(67637L, 81555L, 34210L, 8676L)), row.names = c(NA, -4L), class = c("tbl_df", "tbl", "data.frame"))
my_dataframe %>%
group_by_at(.vars = vars(id:lastName)) %>%
{left_join(
summarise_at(., vars(teamId:teamAbbreviation), ~ str_c(., collapse = ",")),
summarise_at(., vars(fg2PtAtt:minSeconds), mean)
)}
#> Joining, by = c("id", "firstName", "lastName")
#> # A tibble: 3 x 8
#> # Groups: id, firstName [?]
#> id firstName lastName teamId teamAbbreviation fg2PtAtt fg3PtAtt
#> <int> <chr> <chr> <chr> <chr> <dbl> <dbl>
#> 1 9360 Luke Babbitt 91,92 ATL,MIA 29.5 78
#> 2 9466 Quincy Acy 84 BRO 73 292
#> 3 10138 Alex Abrines 96 OKL 70 221
#> # ... with 1 more variable: minSeconds <dbl>
Run Code Online (Sandbox Code Playgroud)
由reprex软件包(v0.2.0)于2018-07-31创建。
| 归档时间: |
|
| 查看次数: |
576 次 |
| 最近记录: |