我有一个按 id 变量(“城市”)排序的数据框,我想保留对那些具有多个观察值的城市的第二个观察值。
例如,下面是一个示例数据集:
city <- c(1,1,2,3,3,4,5,6,7,7,8)
value <- c(3,5,7,8,2,5,4,2,3,2,3)
mydata <- data.frame(city, value)
Run Code Online (Sandbox Code Playgroud)
然后我们有:
city value
1 1 3
2 1 5
3 2 7
4 3 8
5 3 2
6 4 5
7 5 4
8 6 2
9 7 3
10 7 2
11 8 3
Run Code Online (Sandbox Code Playgroud)
理想的结果是:
city value
2 1 5
3 2 7
5 3 2
6 4 5
7 5 4
8 6 2
10 7 2
11 8 3
Run Code Online (Sandbox Code Playgroud)
任何帮助表示赞赏!
library(dplyr)
mydata %>%
group_by(city) %>%
filter(n() == 1L | row_number() == 2L) %>%
ungroup()
# # A tibble: 8 x 2
# city value
# <dbl> <dbl>
# 1 1 5
# 2 2 7
# 3 3 2
# 4 4 5
# 5 5 4
# 6 6 2
# 7 7 2
# 8 8 3
Run Code Online (Sandbox Code Playgroud)
或略有不同
mydata %>%
group_by(city) %>%
slice(min(n(), 2)) %>%
ungroup()
Run Code Online (Sandbox Code Playgroud)
ind <- ave(rep(TRUE, nrow(mydata)), mydata$city,
FUN = function(z) length(z) == 1L | seq_along(z) == 2L)
ind
# [1] FALSE TRUE TRUE FALSE TRUE TRUE TRUE TRUE FALSE TRUE TRUE
mydata[ind,]
# city value
# 2 1 5
# 3 2 7
# 5 3 2
# 6 4 5
# 7 5 4
# 8 6 2
# 10 7 2
# 11 8 3
Run Code Online (Sandbox Code Playgroud)
既然您提到“更大”,您可能会data.table在某些时候考虑它的速度和引用语义。(而且这段代码更加简洁也没什么坏处:-)
library(data.table)
DT <- as.data.table(mydata) # normally one might use setDT(mydata) instead ...
DT[, .SD[min(.N, 2),], by = city]
# city value
# <num> <num>
# 1: 1 5
# 2: 2 7
# 3: 3 2
# 4: 4 5
# 5: 5 4
# 6: 6 2
# 7: 7 2
# 8: 8 3
Run Code Online (Sandbox Code Playgroud)