我想从不同的数据帧合并19个不同长度的列并进行比较.这是一个例子:
df1:
PA0001
PA0002
PA0003
PA0004
PA0005
df2:
PA0001
PA0003
PA0006
PA0007
df3:
PA0001
PA0007
Run Code Online (Sandbox Code Playgroud)
等等...
输出是这样的:
PA0001 | PA0001 | PA0001
PA0002 | NA | NA
PA0003 | PA0003 | NA
PA0004 | NA | NA
PA0005 | NA | NA
NA | PA0006 | NA
NA | PA0007 | PA0007
Run Code Online (Sandbox Code Playgroud)
我使用compare或merge功能,但我没有一个好结果.我试图使用这个问题的功能:链接
但我得到了这个错误:
Error in attributes(.Data) <- c(attributes(.Data), attrib) :
'names' attribute [5254] must be the same length as the vector [2]
Run Code Online (Sandbox Code Playgroud)
在这里你是一个例子:
test1 <- data.frame(c("PA0001","PA0002","PA0003","PA0004","PA0005","PA0006"))
test2 <- data.frame(c("PA0001","PA0002","PA0004","PA0005","PA0007"))
test3 <- data.frame(c("PA0001","PA0004","PA0005","PA0007", "PA0008"))
Run Code Online (Sandbox Code Playgroud)
非常感谢你.
如果我们需要输出作为OP的预期,将在数据集list,rbind的list元素,同时创造了"GRP"列rbindlist,然后dcast从"长"到"宽",同时通过创建公式中的一个顺序列match荷兰国际集团的"带有unique'id'元素的id'
library(data.table)
dcast(rbindlist(list(test1, test2, test3), idcol = 'grp'),
match(id, unique(id)) ~ paste0("col", grp))[, id := NULL][]
# col1 col2 col3
#1: PA0001 PA0001 PA0001
#2: PA0002 NA NA
#3: PA0003 PA0003 NA
#4: PA0004 NA NA
#5: PA0005 NA NA
#6: NA PA0006 NA
#7: NA PA0007 PA0007
Run Code Online (Sandbox Code Playgroud)
或者@jogo分割代码以使其更清晰,在第一步中,通过指定参数创建'grp'列的rbind所有数据集listidcol
t_all <- rbindlist(list(test1, test2, test3), idcol='grp');
Run Code Online (Sandbox Code Playgroud)
然后dcast,以'宽'格式并将'id'列分配给NULL
dcast(t_all, id ~ grp, value.var='id')[, id := NULL][]
Run Code Online (Sandbox Code Playgroud)
test1 <- data.frame(id = c("PA0001","PA0002","PA0003","PA0004","PA0005"))
test2 <- data.frame(id = c("PA0001","PA0003","PA0006","PA0007"))
test3 <- data.frame(id = c("PA0001", "PA0007"))
Run Code Online (Sandbox Code Playgroud)
您可以尝试tidyverse解决方案
library(tidyverse)
d1 <- read.table(text="PA0001
PA0002
PA0003
PA0004
PA0005")
d2 <- read.table(text="PA0001
PA0003
PA0006
PA0007")
d3 <- read.table(text="PA0001
PA0007")
list(d1, d2, d3) %>%
bind_rows(.id = "df") %>%
mutate(n = TRUE) %>%
spread(df, n, fill = FALSE)
V1 1 2 3
1 PA0001 TRUE TRUE TRUE
2 PA0002 TRUE FALSE FALSE
3 PA0003 TRUE TRUE FALSE
4 PA0004 TRUE FALSE FALSE
5 PA0005 TRUE FALSE FALSE
6 PA0006 FALSE TRUE FALSE
7 PA0007 FALSE TRUE TRUE
Run Code Online (Sandbox Code Playgroud)
我们的想法是包括所有data.frames在列表中,然后将它们绑定横行,添加逻辑TRUE和使用tidyr的spread功能,以获得结果.当然你也可以使用以下方法获得你的wxpected输出:
list(d1, d2, d3) %>%
bind_rows(.id="df") %>%
mutate(n=V1) %>%
spread(df, n) %>%
select(-1)
1 2 3
1 PA0001 PA0001 PA0001
2 PA0002 <NA> <NA>
3 PA0003 PA0003 <NA>
4 PA0004 <NA> <NA>
5 PA0005 <NA> <NA>
6 <NA> PA0006 <NA>
7 <NA> PA0007 PA000
Run Code Online (Sandbox Code Playgroud)
在baseR你可以尝试:
Reduce(function(x, y) merge(x, y, by="V1", all.x = TRUE, all.y = TRUE),
lapply(list(d1, d2, d3), function(x) cbind(x,V2=x$V1)))[,-1]
V2.x V2.y V2
1 PA0001 PA0001 PA0001
2 PA0002 <NA> <NA>
3 PA0003 PA0003 <NA>
4 PA0004 <NA> <NA>
5 PA0005 <NA> <NA>
6 <NA> PA0006 <NA>
7 <NA> PA0007 PA0007
Run Code Online (Sandbox Code Playgroud)