我想根据多个条件合并 2 个数据框。
DF1 <- data.frame("col1" = rep(c("A","B"), 18),
"col2" = rep(c("C","D","E"), 12),
"value"= (sample(1:100,36)),
"col4" = rep(NA,36))
DF2 <- data.frame("col1" = rep("A",6),
"col2" = rep(c("C","D"),3),
"data" = rep(c(1,3),3),
"min" = seq(0,59,by=10),
"max" = seq(10,69,by=10))
> DF1
col1 col2 value col4
1 A C 22 NA
2 B D 58 NA
3 A E 35 NA
4 B C 86 NA
5 A D 37 NA
6 B E 16 NA
7 A C 46 NA
8 B D 23 NA
9 A E 88 NA
10 B C 3 NA
11 A D 33 NA
12 B E 25 NA
13 A C 19 NA
14 B D 24 NA
15 A E 9 NA
16 B C 76 NA
17 A D 62 NA
18 B E 68 NA
19 A C 97 NA
20 B D 43 NA
21 A E 8 NA
22 B C 84 NA
23 A D 36 NA
24 B E 20 NA
25 A C 57 NA
26 B D 99 NA
27 A E 42 NA
28 B C 64 NA
29 A D 87 NA
30 B E 1 NA
31 A C 78 NA
32 B D 34 NA
33 A E 41 NA
34 B C 32 NA
35 A D 10 NA
36 B E 72 NA
> DF2
col1 col2 data min max
1 A C 1 0 10
2 A D 3 10 20
3 A C 1 20 30
4 A D 3 30 40
5 A C 1 40 50
6 A D 3 50 60
Run Code Online (Sandbox Code Playgroud)
DF1 是主表,DF2 被视为查找表
如果 DF1 的 col1 和 col2 与 DF2 的匹配,并且 DF1 的“值”介于 DF2 的最小值和最大值之间,则来自 DF2 的列“数据”将被添加到 DF1。如果不满足条件,DF1 的“数据”的值为 NA。
预期输出(前 6 行):
col1 col2 value col4 data
1 A C 22 NA 1
2 B D 58 NA NA
3 A E 35 NA NA
4 B C 86 NA NA
5 A D 37 NA 3
6 B E 16 NA NA
Run Code Online (Sandbox Code Playgroud)
我尝试使用合并(匹配 col1 snd col2)然后使用子集(仅过滤具有最小值和最大值之间的值的行),但我的目标是维护 DF1 的所有行。
任何人都有这个想法?
使用最新版本的data.table,非对等联接和联接更新是可能的:
library(data.table)
head(setDT(DF1)[setDT(DF2), on = c("col1", "col2", "value>=min", "value<=max"),
data := data])
Run Code Online (Sandbox Code Playgroud)
Run Code Online (Sandbox Code Playgroud)rn col1 col2 value col4 data 1: 1 A C 22 NA 1 2: 2 B D 58 NA NA 3: 3 A E 35 NA NA 4: 4 B C 86 NA NA 5: 5 A D 37 NA 3 6: 6 B E 16 NA NA
DF1 <- structure(list(rn = 1:36, col1 = c("A", "B", "A", "B", "A", "B",
"A", "B", "A", "B", "A", "B", "A", "B", "A", "B", "A", "B", "A",
"B", "A", "B", "A", "B", "A", "B", "A", "B", "A", "B", "A", "B",
"A", "B", "A", "B"), col2 = c("C", "D", "E", "C", "D", "E", "C",
"D", "E", "C", "D", "E", "C", "D", "E", "C", "D", "E", "C", "D",
"E", "C", "D", "E", "C", "D", "E", "C", "D", "E", "C", "D", "E",
"C", "D", "E"), value = c(22L, 58L, 35L, 86L, 37L, 16L, 46L,
23L, 88L, 3L, 33L, 25L, 19L, 24L, 9L, 76L, 62L, 68L, 97L, 43L,
8L, 84L, 36L, 20L, 57L, 99L, 42L, 64L, 87L, 1L, 78L, 34L, 41L,
32L, 10L, 72L), col4 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA)), .Names = c("rn",
"col1", "col2", "value", "col4"), row.names = c(NA, -36L), class = "data.frame")
DF2 <- structure(list(rn = 1:6, col1 = c("A", "A", "A", "A", "A", "A"
), col2 = c("C", "D", "C", "D", "C", "D"), data = c(1L, 3L, 1L,
3L, 1L, 3L), min = c(0L, 10L, 20L, 30L, 40L, 50L), max = c(10L,
20L, 30L, 40L, 50L, 60L)), .Names = c("rn", "col1", "col2", "data",
"min", "max"), row.names = c(NA, -6L), class = "data.frame")
Run Code Online (Sandbox Code Playgroud)
您的数据,正在改变stringsAsFactors=F
DF1 <- data.frame("col1" = rep(c("A","B"), 18),
"col2" = rep(c("C","D","E"), 12),
"value"= (sample(1:100,36)),
"col4" = rep(NA,36),
stringsAsFactors=F)
DF2 <- data.frame("col1" = rep("A",6),
"col2" = rep(c("C","D"),3),
"data" = rep(c(1,3),3),
"min" = seq(0,59,by=10),
"max" = seq(10,69,by=10),
stringsAsFactors=F)
Run Code Online (Sandbox Code Playgroud)
使用dplyr,1)使用 合并两个数据left_join,2)检查ifelse value是between min和max rowwise,然后3)取消选择min和max列...
library(dplyr)
left_join(DF1, DF2, by=c("col1","col2")) %>%
rowwise() %>%
mutate(data = ifelse(between(value,min,max), data, NA)) %>%
select(-min, -max)
Run Code Online (Sandbox Code Playgroud)
不确定您是否期望执行某种聚合,但这是上面代码的输出
col1 col2 value col4 data
1 A C 23 NA NA
2 A C 23 NA 1
3 A C 23 NA NA
4 B D 59 NA NA
5 A E 57 NA NA
6 B C 8 NA NA
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
13975 次 |
| 最近记录: |