如果存在于同一组中,则折叠行

chi*_*tra 5 r dplyr

大家好,我有一个数据框,例如:

  Group       family
1     A      Canidae
2     B      Canidae
3     A      Felidae
4     B      Canidae
5     C Elephantidae
6     C    Galinacae
7     D    Galinacae
8     D     Siuridae
9     E       Apidae
Run Code Online (Sandbox Code Playgroud)

我想折叠存在的Groupfamily(例如:

Canidae存在于AB 中, 因此我折叠并添加family2每个组的所有唯一值

Group family2
A,B   Canidae,Felidae 
Run Code Online (Sandbox Code Playgroud)

然后我继续,我看到Elephantidae 和 Galinacae都在C,而且Galinacae也在,D所以我崩溃了:

Group family2
A,B   Canidae,Felidae 
C,D   Elephantidae,Galinacae,Siuridae 
Run Code Online (Sandbox Code Playgroud)

最后我们应该得到:

Group family2
A,B   Canidae,Felidae 
C,D   Elephantidae,Galinacae,Siuridae 
E     Apidae 
Run Code Online (Sandbox Code Playgroud)

有人有想法吗?

这里的数据是为了诸如此类的事情吗?非常感谢您的帮助和时间。

如果有帮助,这里是数据:

structure(list(Group = structure(c(1L, 2L, 1L, 2L, 3L, 3L, 4L, 
4L, 5L), .Label = c("A", "B", "C", "D", "E"), class = "factor"), 
    family = structure(c(2L, 2L, 4L, 2L, 3L, 5L, 5L, 6L, 1L), .Label = c("Apidae", 
    "Canidae", "Elephantidae", "Felidae", "Galinacae", "Siuridae"
    ), class = "factor")), class = "data.frame", row.names = c(NA, 
-9L))
Run Code Online (Sandbox Code Playgroud)

Sin*_*yen 2

这是我的解决方案,具有一些查找功能

# A lookup function that look for intersect between group 
# if there are at least one intersect - those group will be combined
look_up_group <- function(one_group, lookup_list) {
  matched_list <- map(lookup_list, function(x) {
    tryCatch(
      {
        intersect(x, one_group)
      }, error = function(e) {
        stop(paste0("Error in lookup function: one_group=", one_group, "; x=", x))
      }) 
  })
  
  index <- which(unlist(map(matched_list, function(x) { length(x) > 0 })))
  sort(unique(unlist(lookup_list[index])))
}


df %>%
  # First remove all duplicated rows - exactly the same for both Group, Family
  filter(!duplicated(.)) %>%
  # arrange in alphabetical order
  arrange(Group, family) %>%
  # create a Group_2 which is combination of all Group for each family
  group_by(family) %>%
  mutate(Group_2 = list(Group)) %>%
  ungroup() %>%
  # Create Group_3 which is the full combined Group for all intersect Group
  mutate(Group_3 = map(.[["Group_2"]], function(x) { look_up_group(one_group = x, lookup_list = .[["Group_2"]]) })) %>%
  # Combine all Group_3 into a Group_final
  mutate(Group_final = unlist(map(Group_3, function(x) { paste (x, collapse = ",")} ))) %>%
  # Finally put them all together.
  select(Group_final, family) %>%
  group_by(Group_final) %>%
  summarize(family = paste(family, collapse = ","), .groups = "drop")
Run Code Online (Sandbox Code Playgroud)

这是最终的输出

  Group_final family                                   
* <chr>       <chr>                                    
1 A,B         Canidae,Felidae,Canidae                  
2 C,D         Elephantidae,Galinacae,Galinacae,Siuridae
3 E           Apidae   
Run Code Online (Sandbox Code Playgroud)

为了更容易理解,这里列出了每个步骤的详细信息

第一步

# remove duplicate & create variable Group_2
tmp <- df %>%
  filter(!duplicated(.)) %>%
  arrange(Group, family) %>%
  group_by(family) %>%
  mutate(Group_2 = list(Group)) %>%
  ungroup()
Run Code Online (Sandbox Code Playgroud)

我们有这个数据 -

  Group family       Group_2  
  <fct> <fct>        <list>   
1 A     Canidae      <fct [2]>
2 A     Felidae      <fct [1]>
3 B     Canidae      <fct [2]>
4 C     Elephantidae <fct [1]>
5 C     Galinacae    <fct [2]>
6 D     Galinacae    <fct [2]>
7 D     Siuridae     <fct [1]>
8 E     Apidae       <fct [1]>
Run Code Online (Sandbox Code Playgroud)

Group_2 看起来像这样

> tmp$Group_2
[[1]]
[1] A B
Levels: A B C D E

[[2]]
[1] A
Levels: A B C D E

[[3]]
[1] A B
Levels: A B C D E

[[4]]
[1] C
Levels: A B C D E

[[5]]
[1] C D
Levels: A B C D E

[[6]]
[1] C D
Levels: A B C D E

[[7]]
[1] D
Levels: A B C D E

[[8]]
[1] E
Levels: A B C D E
Run Code Online (Sandbox Code Playgroud)

然后下一步创建Group_3并将它们组合成Group_final

# Create Group_3
tmp <- tmp %>% 
  mutate(Group_3 = map(.[["Group_2"]],
    function(x) { look_up_group(one_group = x, lookup_list = .[["Group_2"]]) })) %>%
  mutate(Group_final = unlist(map(Group_3, function(x) { paste (x, collapse = ",")} )))
Run Code Online (Sandbox Code Playgroud)

这是新的 tmp

# A tibble: 8 x 5
  Group family       Group_2   Group_3   Group_final
  <fct> <fct>        <list>    <list>    <chr>      
1 A     Canidae      <fct [2]> <fct [2]> A,B        
2 A     Felidae      <fct [1]> <fct [2]> A,B        
3 B     Canidae      <fct [2]> <fct [2]> A,B        
4 C     Elephantidae <fct [1]> <fct [2]> C,D        
5 C     Galinacae    <fct [2]> <fct [2]> C,D        
6 D     Galinacae    <fct [2]> <fct [2]> C,D        
7 D     Siuridae     <fct [1]> <fct [2]> C,D        
8 E     Apidae       <fct [1]> <fct [1]> E     
Run Code Online (Sandbox Code Playgroud)

然后最后一步将 Group_final 中的家庭组合在一起

tmp %>%
  select(Group_final, family) %>%
  group_by(Group_final) %>%
  summarize(family = paste(family, collapse = ","), .groups = "drop")
Run Code Online (Sandbox Code Playgroud)

最终结果

# A tibble: 3 x 2
  Group_final family                                   
* <chr>       <chr>                                    
1 A,B         Canidae,Felidae,Canidae                  
2 C,D         Elephantidae,Galinacae,Galinacae,Siuridae
3 E           Apidae 
Run Code Online (Sandbox Code Playgroud)

[更新:添加了用于调试的 tryCatch]