BIN*_*BIN 4 if-statement r duplicates data.table
我有数据集
ID <- c(1,1,2,2,2,2,3,3,3,3,3,4,4,4)
Eval <- c("A","A","B","B","A","A","A","A","B","B","A","A","A","B")
med <- c("c","d","k","k","h","h","c","d","h","h","h","c","h","k")
df <- data.frame(ID,Eval,med)
> df
ID Eval med
1 1 A c
2 1 A d
3 2 B k
4 2 B k
5 2 A h
6 2 A h
7 3 A c
8 3 A d
9 3 B h
10 3 B h
11 3 A h
12 4 A c
13 4 A h
14 4 B k
Run Code Online (Sandbox Code Playgroud)
我尝试创建变量,x并按yID和Eval分组.对于每个ID,if Eval = A, and med = "h" or "k"我设置x = 1,其他明智x = 0,if Eval = B and med = "h" or "k"我设置y = 1,其他明智y = 0.我使用的方式我不喜欢它,我得到了答案,但它似乎不那么好
df <- data.table(df)
setDT(df)[, count := uniqueN(med) , by = .(ID,Eval)]
setDT(df)[Eval == "A", x:= ifelse(count == 1 & med %in% c("k","h"),1,0), by=ID]
setDT(df)[Eval == "B", y:= ifelse(count == 1 & med %in% c("k","h"),1,0), by=ID]
ID Eval med count x y
1: 1 A c 2 0 NA
2: 1 A d 2 0 NA
3: 2 B k 1 NA 1
4: 2 B k 1 NA 1
5: 2 A h 1 1 NA
6: 2 A h 1 1 NA
7: 3 A c 3 0 NA
8: 3 A d 3 0 NA
9: 3 B h 1 NA 1
10: 3 B h 1 NA 1
11: 3 A h 3 0 NA
12: 4 A c 2 0 NA
13: 4 A h 2 0 NA
14: 4 B k 1 NA 1
Run Code Online (Sandbox Code Playgroud)
然后我需要折叠行来获取唯一ID,我不知道如何折叠行,任何想法?
输出
ID x y
1 0 0
2 1 1
3 0 1
4 0 1
Run Code Online (Sandbox Code Playgroud)
我们创建按'ID'分组的'x'和'y'变量,而NA元素不直接将逻辑向量强制转换为binary(as.integer)
df[, x := as.integer(Eval == "A" & count ==1 & med %in% c("h", "k")) , by = ID]
Run Code Online (Sandbox Code Playgroud)
和'y'类似
df[, y := as.integer(Eval == "B" & count ==1 & med %in% c("h", "k")) , by = ID]
Run Code Online (Sandbox Code Playgroud)
并any通过"ID"分组后使用它进行总结
df[, lapply(.SD, function(x) as.integer(any(x))) , ID, .SDcols = x:y]
# ID x y
#1: 1 0 0
#2: 2 1 1
#3: 3 0 1
#4: 4 0 1
Run Code Online (Sandbox Code Playgroud)
如果我们需要一个紧凑的方法,而不是assinging(:=),我们总结根据条件按"ID","Eval"分组的输出,然后按'ID'分组,我们检查any'x'中是否有TRUE值'y'循环遍历在中描述的列.SDcols.
setDT(df)[, if(any(uniqueN(med)==1 & med %in% c("h", "k"))) {
.(x= Eval=="A", y= Eval == "B") } else .(x=FALSE, y=FALSE),
by = .(ID, Eval)][, lapply(.SD, any) , by = ID, .SDcols = x:y]
# ID x y
#1: 1 FALSE FALSE
#2: 2 TRUE TRUE
#3: 3 FALSE TRUE
#4: 4 FALSE TRUE
Run Code Online (Sandbox Code Playgroud)
如果需要,我们可以转换为二进制类似于第一个解决方案中显示的方法.
OP的目标......
"我尝试创建变量x和y,按ID和Eval分组.对于每个ID,如果Eval = A,med ="h"或"k",我设置x = 1,其他方式x = 0,如果Eval = B和med ="h"或"k",我设置y = 1,其他y = 0. [...]然后我需要折叠该行以获得唯一ID"
可以简化为......
对于每个ID和Eval,如果所有med值都是h或所有med值都是k,则标记.
setDT(df) # only do this once
df[, all(med=="k") | all(med=="h"), by=.(ID,Eval)][, dcast(.SD, ID ~ Eval, fun=any)]
ID A B
1: 1 FALSE FALSE
2: 2 TRUE TRUE
3: 3 FALSE TRUE
4: 4 FALSE TRUE
Run Code Online (Sandbox Code Playgroud)
要查看dcast正在做什么,请阅读?dcast并尝试单独运行第一部分,df[, all(med=="k") | all(med=="h"), by=.(ID,Eval)].
使用x和y而不是A和B的更改很简单但不明智(因为不必要的重命名可能会造成混淆,并且当有新的Eval值时会导致额外的工作); 并且改变1/0而不是TRUE/FALSE(因为捕获的值实际上是布尔值).