组合列,同时忽略重复和NA

HNS*_*SKD 4 r dataframe dplyr tidyr

我有一个数据帧如下,我想结合两列,即Var1Var2.我希望组合列(Var3)不包含重复项<alpha><digit>.也就是说,如果Var1 == A1Var2 == A1,因此Var3 == A1而不是Var3 == A1-A1或者Var1 == A4-E9Var2 == A4,因此,Var3 == A4-E9但不Var3 == A4-E9-A4

df <- read.table(header = TRUE, text = 
"id  Var1    Var2
A   A1       A1
B   F2       A2
C   NA       A3
D   A4-E9    A4
E   E5       A5
F   NA       NA
G   B2-R4    A3-B2
H   B3-B4    E1-G5", stringsAsFactors = FALSE)
Run Code Online (Sandbox Code Playgroud)

以下是我的代码.我想改进它的可读性以及摆脱NA第3行的条目Var3,即A3-NA.

library(dplyr)
library(tidyr)
df %>% 
  mutate(Var3 = paste(Var1, Var2, sep = "-"))  %>%
  separate_rows(Var3, sep = "-") %>%
  group_by(id, Var3) %>%
  slice(1) %>%
  group_by(id) %>%
  mutate(Var3 = paste(unlist(Var3[!is.na(Var3)]), collapse = "-")) %>%
  slice(1) %>%
  ungroup
Run Code Online (Sandbox Code Playgroud)

这是我想要的输出:

# A tibble: 8 x 4
     id  Var1  Var2        Var3
  <chr> <chr> <chr>       <chr>
1     A    A1    A1          A1
2     B    F2    A2       A2-F2
3     C  <NA>    A3          A3
4     D A4-E9    A4       A4-E9
5     E    E5    A5       A5-E5
6     F  <NA>  <NA>        <NA>
7     G B2-R4 A3-B2    A3-B2-R4
8     H B3-B4 E1-G5 B3-B4-E1-G5
Run Code Online (Sandbox Code Playgroud)

akr*_*run 5

如果"DF1"是输出,则我们会删除该遵循的"NA" -sub

df1 %>% 
    mutate(Var3 = sub("-NA", "", Var3))
# A tibble: 8 x 4
#     id  Var1  Var2        Var3
#  <chr> <chr> <chr>       <chr>
#1     A    A1    A1          A1
#2     B    F2    A2       A2-F2
#3     C  <NA>    A3          A3
#4     D A4-E9    A4       A4-E9
#5     E    E5    A5       A5-E5
#6     F  <NA>  <NA>          NA
#7     G B2-R4 A3-B2    A3-B2-R4
#8     H B3-B4 E1-G5 B3-B4-E1-G5
Run Code Online (Sandbox Code Playgroud)

我们也可以tidyverse通过gather'long'格式稍微改变一下,然后使用separate_rows'id'分组'value'列,summarise'var3'列,'Var3' pastesorted unique元素和left_join原始数据集' DF"

library(tidyverse)
gather(df, key, value, -id) %>%
       separate_rows(value)  %>%
       group_by(id) %>% 
       summarise(Var3 = paste(sort(unique(value)), collapse='-')) %>% 
       mutate(Var3 = replace(Var3, Var3=='', NA)) %>% 
       left_join(df, .)
#   id  Var1  Var2        Var3
#1  A    A1    A1          A1
#2  B    F2    A2       A2-F2
#3  C  <NA>    A3          A3
#4  D A4-E9    A4       A4-E9
#5  E    E5    A5       A5-E5
#6  F  <NA>  <NA>        <NA>
#7  G B2-R4 A3-B2    A3-B2-R4
#8  H B3-B4 E1-G5 B3-B4-E1-G5
Run Code Online (Sandbox Code Playgroud)

注意:%>%甚至使一个简单的代码出现在多行中,但如果需要,我们可以将所有这些语句放在一行和术语中one-liner


这是一个单行

library(data.table)
setDT(df)[, Var3 := paste(sort(unique(unlist(strsplit(unlist(.SD),"-")))), collapse="-"), id]
Run Code Online (Sandbox Code Playgroud)