合并/组合具有相同名称但不完整数据的列

abc*_*t19 18 merge r

我有两个数据框,其中一些列具有相同的名称,另一些具有不同的名称.数据框看起来像这样:

df1
      ID hello world hockey soccer
    1  1    NA    NA      7      4
    2  2    NA    NA      2      5
    3  3    10     8      8     23
    4  4     4    17      5     12
    5  5    NA    NA      3     43

df2    
      ID hello world football baseball
    1  1     2     3       43        6
    2  2     5     1       24       32
    3  3    NA    NA        2       23
    4  4    NA    NA        5       15
    5  5     9     7       12       23
Run Code Online (Sandbox Code Playgroud)

如您所见,在2个共享列("hello"和"world")中,某些数据位于其中一个数据框中,其余数据位于另一个数据框中.

我要做的是(1)通过"id"合并2个数据帧,(2)将两个帧中"hello"和"world"列的所有数据合并为1个"hello"列和1个"world" "列,以及(3)具有与最终数据帧还包含所有在2个原始帧的其他列的("曲棍球",'足球’,'足球’,'棒球’).所以,我希望最终的结果如下:

  ID hello world hockey soccer football baseball
1  1     2     3      7      4        43       6
2  2     5     3      2      5        24      32
3  3    10     8      8     23         2      23
4  4     4    17      5     12         5      15
5  5     9     7      3     43        12      23
Run Code Online (Sandbox Code Playgroud)

我在r很新,所以唯一的代码我试过都在变化merge,我已经试过我发现这里的答案,这是基于一个类似的问题:R:合并同一个变量的副本.但是,我的数据集实际上比我在这里显示的要大得多(大约有20个匹配的列(如"你好"和"世界")和100个不匹配的列(如"曲棍球"和"足球"))所以我正在寻找一些不需要我手动编写的东西.

有什么想法可以做到吗?对不起,我无法提供我的努力样本,但我真的不知道从哪里开始:

mydata <- merge(df1, df2, by=c("ID"), all = TRUE)
Run Code Online (Sandbox Code Playgroud)

要重现数据框:

df1 <- structure(list(ID = c(1L, 2L, 3L, 4L, 5L), hellow = c(2, 5, NA, NA, 9), 
       world = c(3, 1, NA, NA, 7), football = c(43, 24, 2, 5, 12), 
       baseball = c(6, 32, 23, 15, 23)), .Names = c("ID", "hello", "world", 
       "football", "baseball"), class = "data.frame", row.names = c(NA, -5L)) 

df2 <- structure(list(ID = c(1L, 2L, 3L, 4L, 5L), hellow = c(NA, NA, 10, 4, NA), 
       world = c(NA, NA, 8, 17, NA), hockey = c(7, 2, 8, 5, 3), 
       soccer = c(4, 5, 23, 12, 43)), .Names = c("ID", "hello", "world", "hockey", 
       "soccer"), class = "data.frame", row.names = c(NA, -5L))
Run Code Online (Sandbox Code Playgroud)

A5C*_*2T1 12

这是一种涉及melt数据,合并熔融数据以及使用dcast将其恢复为宽泛形式的方法.我添加了评论以帮助了解正在发生的事情.

## Required packages
library(data.table)
library(reshape2)

dcast.data.table(
  merge(
    ## melt the first data.frame and set the key as ID and variable
    setkey(melt(as.data.table(df1), id.vars = "ID"), ID, variable), 
    ## melt the second data.frame
    melt(as.data.table(df2), id.vars = "ID"), 
    ## you'll have 2 value columns...
    all = TRUE)[, value := ifelse(
      ## ... combine them into 1 with ifelse
      is.na(value.x), value.y, value.x)], 
  ## This is your reshaping formula
  ID ~ variable, value.var = "value")
#    ID hello world football baseball hockey soccer
# 1:  1     2     3       43        6      7      4
# 2:  2     5     1       24       32      2      5
# 3:  3    10     8        2       23      8     23
# 4:  4     4    17        5       15      5     12
# 5:  5     9     7       12       23      3     43
Run Code Online (Sandbox Code Playgroud)


thc*_*thc 8

没有人发布dplyr解决方案,所以这里有一个简洁的选择dplyr.这种方法是简单地做一个full_join结合了所有的行,然后groupsummarise除去多余的缺失单元.

library(tidyverse)
df1 <- structure(list(ID = 1:5, hello = c(NA, NA, 10L, 4L, NA), world = c(NA, NA, 8L, 17L, NA), hockey = c(7L, 2L, 8L, 5L, 3L), soccer = c(4L, 5L, 23L, 12L, 43L)), row.names = c(NA, -5L), class = c("tbl_df", "tbl", "data.frame"), spec = structure(list(cols = list(ID = structure(list(), class = c("collector_integer", "collector")), hello = structure(list(), class = c("collector_integer", "collector")), world = structure(list(), class = c("collector_integer", "collector")), hockey = structure(list(), class = c("collector_integer", "collector")), soccer = structure(list(), class = c("collector_integer", "collector"))), default = structure(list(), class = c("collector_guess", "collector"))), class = "col_spec"))
df2 <- structure(list(ID = 1:5, hello = c(2L, 5L, NA, NA, 9L), world = c(3L, 1L, NA, NA, 7L), football = c(43L, 24L, 2L, 5L, 12L), baseball = c(6L, 32L, 23L, 15L, 2L)), row.names = c(NA, -5L), class = c("tbl_df", "tbl", "data.frame"), spec = structure(list(cols = list(ID = structure(list(), class = c("collector_integer", "collector")), hello = structure(list(), class = c("collector_integer", "collector")), world = structure(list(), class = c("collector_integer", "collector")), football = structure(list(), class = c("collector_integer", "collector")), baseball = structure(list(), class = c("collector_integer", "collector"))), default = structure(list(), class = c("collector_guess", "collector"))), class = "col_spec"))

df1 %>%
  full_join(df2, by = intersect(colnames(df1), colnames(df2))) %>%
  group_by(ID) %>%
  summarize_all(na.omit)
#> # A tibble: 5 x 7
#>      ID hello world hockey soccer football baseball
#>   <int> <int> <int>  <int>  <int>    <int>    <int>
#> 1     1     2     3      7      4       43        6
#> 2     2     5     1      2      5       24       32
#> 3     3    10     8      8     23        2       23
#> 4     4     4    17      5     12        5       15
#> 5     5     9     7      3     43       12        2
Run Code Online (Sandbox Code Playgroud)

reprex包创建于2018-07-13 (v0.2.0).


Dav*_*urg 6

这是data.table使用二进制合并的另一种方法

library(data.table)
setkey(setDT(df1), ID) ; setkey(setDT(df2), ID) # Converting to data.table objects and setting keys
df1 <- df1[df2][, `:=`(i.hello = NULL, i.world = NULL)] # Full left join
df1[df2[complete.cases(df2)], `:=`(hello = i.hello, world = i.world)][] # Joining only on non-missing values
#    ID hello world football baseball hockey soccer
# 1:  1     2     3       43        6      7      4
# 2:  2     5     1       24       32      2      5
# 3:  3    10     8        2       23      8     23
# 4:  4     4    17        5       15      5     12
# 5:  5     9     7       12       23      3     43
Run Code Online (Sandbox Code Playgroud)


Nik*_*kos 5

@ ananda-mahto的答案更优雅,但这是我的建议:

library(reshape2)
df1=melt(df1,id='ID',na.rm=TRUE)
df2=melt(df2,id='ID',na.rm=TRUE)
DF=rbind(df1,df2)
# Not needeed,  added na.rm=TRUE based on @ananda-mahto's valid comment
# DF<-DF[!is.na(DF$value),]
dcast(DF,ID~variable,value.var='value')
Run Code Online (Sandbox Code Playgroud)


Cal*_*You 5

这是一种更为tidyr中心的方法,它与当前接受的答案类似.方法只是将数据框堆叠在一起bind_rows(使列名匹配),gather向上堆叠所有非IDna.rm = TRUE,然后将spread它们退出.对于条件"如果值为NA in"df1"它将具有"df2"中的值(反之亦然)"的情况应该是稳健的"与summarise选项相比,"并不总是成立.

library(tidyverse)
df1 <- structure(list(ID = 1:5, hello = c(NA, NA, 10L, 4L, NA), world = c(NA, NA, 8L, 17L, NA), hockey = c(7L, 2L, 8L, 5L, 3L), soccer = c(4L, 5L, 23L, 12L, 43L)), row.names = c(NA, -5L), class = c("tbl_df", "tbl", "data.frame"), spec = structure(list(cols = list(ID = structure(list(), class = c("collector_integer", "collector")), hello = structure(list(), class = c("collector_integer", "collector")), world = structure(list(), class = c("collector_integer", "collector")), hockey = structure(list(), class = c("collector_integer", "collector")), soccer = structure(list(), class = c("collector_integer", "collector"))), default = structure(list(), class = c("collector_guess", "collector"))), class = "col_spec"))
df2 <- structure(list(ID = 1:5, hello = c(2L, 5L, NA, NA, 9L), world = c(3L, 1L, NA, NA, 7L), football = c(43L, 24L, 2L, 5L, 12L), baseball = c(6L, 32L, 23L, 15L, 2L)), row.names = c(NA, -5L), class = c("tbl_df", "tbl", "data.frame"), spec = structure(list(cols = list(ID = structure(list(), class = c("collector_integer", "collector")), hello = structure(list(), class = c("collector_integer", "collector")), world = structure(list(), class = c("collector_integer", "collector")), football = structure(list(), class = c("collector_integer", "collector")), baseball = structure(list(), class = c("collector_integer", "collector"))), default = structure(list(), class = c("collector_guess", "collector"))), class = "col_spec"))

df1 %>%
  bind_rows(df2) %>%
  gather(variable, value, -ID, na.rm = TRUE) %>%
  spread(variable, value)
#> # A tibble: 5 x 7
#>      ID baseball football hello hockey soccer world
#>   <int>    <int>    <int> <int>  <int>  <int> <int>
#> 1     1        6       43     2      7      4     3
#> 2     2       32       24     5      2      5     1
#> 3     3       23        2    10      8     23     8
#> 4     4       15        5     4      5     12    17
#> 5     5        2       12     9      3     43     7
Run Code Online (Sandbox Code Playgroud)

reprex包创建于2018-07-13 (v0.2.0).


Moo*_*per 5

使用tidyverse我们可以使用coalesce.

下面的解决方案都没有构建额外的行,数据在整个链中或多或少保持相同的大小和相似的形状。

解决方案1

list(df1,df2) %>%
  transpose(union(names(df1),names(df2))) %>%
  map_dfc(. %>% compact %>% invoke(coalesce,.))

# # A tibble: 5 x 7
#      ID hello world football baseball hockey soccer
#   <int> <dbl> <dbl>    <dbl>    <dbl>  <dbl>  <dbl>
# 1     1     2     3       43        6      7      4
# 2     2     5     1       24       32      2      5
# 3     3    10     8        2       23      8     23
# 4     4     4    17        5       15      5     12
# 5     5     9     7       12       23      3     43
Run Code Online (Sandbox Code Playgroud)

说明

  • 将两个数据帧包装成一个 list
  • transpose它,因此根中的每个新项目都有输出列的名称。的默认行为transpose是将第一个参数作为模板,因此不幸的是,我们必须明确获取所有参数。
  • compact这些项目,因为它们的长度都是 2,但其中一个是NULL当给定的列在一侧丢失时。
  • coalesce那些,这基本上意味着在NA并排放置参数时返回您找到的第一个非。

如果在第二行重复df1df2出现问题,请改用以下内容:

transpose(invoke(union, setNames(map(., names), c("x","y"))))
Run Code Online (Sandbox Code Playgroud)

解决方案2

相同的理念,但这次我们循环名称:

map_dfc(set_names(union(names(df1), names(df2))),
        ~ invoke(coalesce, compact(list(df1[[.x]], df2[[.x]]))))

# # A tibble: 5 x 7
#      ID hello world football baseball hockey soccer
#   <int> <dbl> <dbl>    <dbl>    <dbl>  <dbl>  <dbl>
# 1     1     2     3       43        6      7      4
# 2     2     5     1       24       32      2      5
# 3     3    10     8        2       23      8     23
# 4     4     4    17        5       15      5     12
# 5     5     9     7       12       23      3     43
Run Code Online (Sandbox Code Playgroud)

这里曾经为那些可能更喜欢的人提供:

union(names(df1), names(df2)) %>%
  set_names %>%
  map_dfc(~ list(df1[[.x]], df2[[.x]]) %>%
            compact %>%
            invoke(coalesce, .))
Run Code Online (Sandbox Code Playgroud)

说明

  • set_names给出与其值相同的字符向量名称,因此map_dfc可以正确命名输出的列。
  • df1[[.x]]NULL.x不是 的列时将返回df1,我们利用这一点。
  • df1并且df2每次被提及 2 次,我想不出任何解决方法。

解决方案 1 在这些方面更清晰,所以我推荐它。