使用 dplyr、group_by 与 mutate() 或 Summarize() & str_c() 或 Paste() & Collapse 连接字符串/行,但保留 NA 和所有字符串

MrG*_*ker 5 group-by r concatenation na dplyr

当使用dplyr, group_by()andmutate()summarize ()paste()and连接字符串时collapseNA值将被强制转换为字符串"NA"

当使用str_c()代替 时paste(),连接的字符串NA将被删除(?str_c每当将缺失值与另一个字符串组合时,结果将始终缺失)。当具有NA&non-NA值的这种组合时,如何删除连接中的NA而不是non- ?NA

请参阅下面我的示例:

library(dplyr)
library(stringr)
ID <- c(1,1,2,2,3,4)
string <- c(' asfdas ', 'sdf', NA,'sadf', 'NA', NA)
df <- data.frame(ID, string)
#   ID   string
# 1  1  asfdas 
# 2  1      sdf
# 3  2     <NA> # ID 2 has both NA and non-NA values
# 4  2     sadf #
# 5  3       NA
# 6  4     <NA>
Run Code Online (Sandbox Code Playgroud)

两个都,

df%>%
 group_by(ID)%>%
 summarize(string = paste(string, collapse = "; "))%>%
 distinct_all()
Run Code Online (Sandbox Code Playgroud)

df_conca <-df%>%
 group_by(ID)%>%
 dplyr::mutate(string = paste(string, collapse = "; "))%>%
 distinct_all()
Run Code Online (Sandbox Code Playgroud)

导致

     ID string               
1     1 " asfdas ; sdf"
2     2 "NA; sadf"           
3     3 "NA"
4     4 "NA" # NA coerced to "NA"
Run Code Online (Sandbox Code Playgroud)

NA变为“NA”:

尽管

df %>%
  group_by(ID)%>%
  summarize(string = str_c(string, collapse = "; "))
Run Code Online (Sandbox Code Playgroud)

结果是:

     ID string               
1     1 " asfdas ; sdf"
2     2 NA     
3     3 "NA" 
4     4 NA 
Run Code Online (Sandbox Code Playgroud)

即根据规则删除“sadf” str_cNA与字符串组合,结果为NA.

但是,我想保留真实NA值(例如“ID”4)和字符串(例如“ID”2),如下所示:

     ID string             
1     1 " asfdas ; sdf"
2     2 "sadf"           
3     3 "NA"
4     4 NA 
Run Code Online (Sandbox Code Playgroud)

理想情况下,我想留在dplyr工作流程中。


这个问题是使用 dplyr、group_by 和折叠或汇总连接字符串/行的扩展,但保留 NA 值

akr*_*run 3

使用pivot_widerunite

library(dplyr)
library(tidyr)
library(data.table)
df %>% 
   mutate(rn = rowid(ID)) %>%
   pivot_wider(names_from = rn, values_from = string) %>% 
   unite(string, `1`, `2`, na.rm = TRUE, sep = " ; ")%>%
   mutate(string = na_if(string, ""))
Run Code Online (Sandbox Code Playgroud)

-输出

# A tibble: 4 x 2
     ID string          
  <dbl> <chr>           
1     1 " asfdas  ; sdf"
2     2 "sadf"          
3     3 "NA"            
4     4  <NA>         
Run Code Online (Sandbox Code Playgroud)

或者也可以使用coalesce

df %>%
    group_by(ID) %>%
    summarise(string = na_if(coalesce(str_c(string, collapse = " ; "),
     str_c(string[complete.cases(string)], collapse = " ; ")), ""))
Run Code Online (Sandbox Code Playgroud)

-输出

# A tibble: 4 x 2
     ID string          
  <dbl> <chr>           
1     1 " asfdas  ; sdf"
2     2 "sadf"          
3     3 "NA"            
4     4  <NA>          
Run Code Online (Sandbox Code Playgroud)