根据规则删除多个列和行中的重复项

Lou*_*sen 3 r duplicates conditional-statements

假设我有以下数据:

dt <- data.frame(id=c(1,1,2,2,3,3,3,4,5,5,5,5,6,7,7),
             rk=c("a","a","b","b","c","y","c","d","e","y","e","e","f","g","h"),
             .id=c("df1", "df9", "df5", "df16", "df2", "df11", "df11", "df4", "df9", "df4", "df6", "df3", "df16", "df2", "df9"))
Run Code Online (Sandbox Code Playgroud)

所以我的数据看起来像这样:

id   rk  .id
1    a   df1
1    a   df9
2    b   df5
2    b  df16
3    c   df2
3    y  df11
3    c  df11
4    d   df4
5    e   df9
5    y   df4
5    e   df6
5    e   df3
6    f  df16
7    g   df2
7    h   df9
Run Code Online (Sandbox Code Playgroud)

但我只想要一个每双行IDRK.因此在示例中,id = 5可以有两行:一行rk = e,另一行rk = y.

要查找正确的行,请查看.id列.在这里,我按以下顺序排列优先顺序:

df2,df9,df1,df5,df4,df6,df15,df17,df16,df14,df8,df11,df3,df7,df12,df13,df10

因此,我总是会在.id = df9的行中使用.id = df2.同样地,我总是在.id = df14的一行上保持一行.id = df15.

请注意,订单不是按时间顺序排列的.

回到我的示例数据,这是我想要最终得到的:

id   rk  .id
1    a   df9
2    b   df5
3    c   df2
3    y  df11
4    d   df4
5    e   df9
5    y   df4
6    f  df16
7    g   df2
7    h   df9
Run Code Online (Sandbox Code Playgroud)

我的数据集非常庞大,所以我希望你们中的一些人可以帮助我编写一些简单易用的代码.

Ron*_*hah 6

随着dplyr我们可以group_by idrk并获得第一match.idnew_order.

library(dplyr)
dt %>%
  group_by(id, rk) %>%
  summarise(.id = .id[which.min(match(.id, new_order))])

#   id rk    .id  
#   <dbl> <fct> <fct>
# 1  1.00 a     df9  
# 2  2.00 b     df5  
# 3  3.00 c     df2  
# 4  3.00 y     df11 
# 5  4.00 d     df4  
# 6  5.00 e     df9  
# 7  5.00 y     df4  
# 8  6.00 f     df16 
# 9  7.00 g     df2  
#10  7.00 h     df9 
Run Code Online (Sandbox Code Playgroud)

等价的,基本R aggregate选项是

aggregate(.id~id+rk, dt, function(x) x[which.min(match(x, new_order))]) 
Run Code Online (Sandbox Code Playgroud)

如果我们想要保留其他一些列,我们可以使用filter而不是summarise

dt %>%
 group_by(id, rk) %>%
 filter(.id == .id[which.min(match(.id, new_order))])
Run Code Online (Sandbox Code Playgroud)

其等价ave选项

dt[with(dt, .id ==  ave(.id, id, rk, FUN = function(x) 
                    x[which.min(match(x, new_order))])), ]
Run Code Online (Sandbox Code Playgroud)

哪里,

new_order <- c("df2", "df9", "df1", "df5", "df4", "df6", "df15", "df17", "df16",
           "df14", "df6", "df8", "df11", "df3", "df7", "df12", "df13", "df10")
Run Code Online (Sandbox Code Playgroud)