Lou*_*sen 3 r duplicates conditional-statements
假设我有以下数据:
dt <- data.frame(id=c(1,1,2,2,3,3,3,4,5,5,5,5,6,7,7),
rk=c("a","a","b","b","c","y","c","d","e","y","e","e","f","g","h"),
.id=c("df1", "df9", "df5", "df16", "df2", "df11", "df11", "df4", "df9", "df4", "df6", "df3", "df16", "df2", "df9"))
Run Code Online (Sandbox Code Playgroud)
所以我的数据看起来像这样:
id rk .id
1 a df1
1 a df9
2 b df5
2 b df16
3 c df2
3 y df11
3 c df11
4 d df4
5 e df9
5 y df4
5 e df6
5 e df3
6 f df16
7 g df2
7 h df9
Run Code Online (Sandbox Code Playgroud)
但我只想要一个每双行ID和RK.因此在示例中,id = 5可以有两行:一行rk = e,另一行rk = y.
要查找正确的行,请查看.id列.在这里,我按以下顺序排列优先顺序:
df2,df9,df1,df5,df4,df6,df15,df17,df16,df14,df8,df11,df3,df7,df12,df13,df10
因此,我总是会在.id = df9的行中使用.id = df2.同样地,我总是在.id = df14的一行上保持一行.id = df15.
请注意,订单不是按时间顺序排列的.
回到我的示例数据,这是我想要最终得到的:
id rk .id
1 a df9
2 b df5
3 c df2
3 y df11
4 d df4
5 e df9
5 y df4
6 f df16
7 g df2
7 h df9
Run Code Online (Sandbox Code Playgroud)
我的数据集非常庞大,所以我希望你们中的一些人可以帮助我编写一些简单易用的代码.
随着dplyr我们可以group_by id和rk并获得第一match的.id用new_order.
library(dplyr)
dt %>%
group_by(id, rk) %>%
summarise(.id = .id[which.min(match(.id, new_order))])
# id rk .id
# <dbl> <fct> <fct>
# 1 1.00 a df9
# 2 2.00 b df5
# 3 3.00 c df2
# 4 3.00 y df11
# 5 4.00 d df4
# 6 5.00 e df9
# 7 5.00 y df4
# 8 6.00 f df16
# 9 7.00 g df2
#10 7.00 h df9
Run Code Online (Sandbox Code Playgroud)
等价的,基本R aggregate选项是
aggregate(.id~id+rk, dt, function(x) x[which.min(match(x, new_order))])
Run Code Online (Sandbox Code Playgroud)
如果我们想要保留其他一些列,我们可以使用filter而不是summarise
dt %>%
group_by(id, rk) %>%
filter(.id == .id[which.min(match(.id, new_order))])
Run Code Online (Sandbox Code Playgroud)
其等价ave选项
dt[with(dt, .id == ave(.id, id, rk, FUN = function(x)
x[which.min(match(x, new_order))])), ]
Run Code Online (Sandbox Code Playgroud)
哪里,
new_order <- c("df2", "df9", "df1", "df5", "df4", "df6", "df15", "df17", "df16",
"df14", "df6", "df8", "df11", "df3", "df7", "df12", "df13", "df10")
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
49 次 |
| 最近记录: |