如何在R data.table中逐行使用ifelse？

Question

如何在R data.table中逐行使用ifelse？

我想R data.table基于ifelse()不同列的比较创建一个新列.但是,我希望该ifelse声明可以逐行应用.我已经尝试过使用组的by功能data.table,但它似乎应用了行的test条件,ifelse但是在列中的所有值中评估yes条件,而不是使用条件按行进行.下面是一个例子和我尝试过的一些解决方案.by

我有R data.table这样的:

> set.seed(45)
> DT <- data.table(date = c(rep("2018-01-01", 3), rep("2018-01-02", 3), rep("2018-01-03", 3)), 
+                  id = rep(letters[1:3], 3), 
+                  v1 = sample(x = -20:20, size = 9), 
+                  v2 = sample(x = -20:20, size = 9))
> str(DT)
Classes ‘data.table’ and 'data.frame':  9 obs. of  4 variables:
 $ date: chr  "2018-01-01" "2018-01-01" "2018-01-01" "2018-01-02" ...
 $ id  : chr  "a" "b" "c" "a" ...
 $ v1  : int  5 -8 -11 -6 -7 -10 -13 -2 -14
 $ v2  : int  -20 -6 14 -9 -3 -5 19 12 -16
 - attr(*, ".internal.selfref")=<externalptr> 
> DT
         date id  v1  v2
1: 2018-01-01  a   5 -20
2: 2018-01-01  b  -8  -6
3: 2018-01-01  c -11  14
4: 2018-01-02  a  -6  -9
5: 2018-01-02  b  -7  -3
6: 2018-01-02  c -10  -5
7: 2018-01-03  a -13  19
8: 2018-01-03  b  -2  12
9: 2018-01-03  c -14 -16

Run Code Online (Sandbox Code Playgroud)

我想要以下输出:

> DT_out
         date id  v1  v2  c
1: 2018-01-01  a   5 -20  0
2: 2018-01-01  b  -8  -6  0
3: 2018-01-01  c -11  14 11
4: 2018-01-02  a  -6  -9  0
5: 2018-01-02  b  -7  -3  0
6: 2018-01-02  c -10  -5  0
7: 2018-01-03  a -13  19 13
8: 2018-01-03  b  -2  12  2
9: 2018-01-03  c -14 -16  0

Run Code Online (Sandbox Code Playgroud)

我试过的解决方案:

尝试#1)没有错误,但评估min所有值都v1和v2.这种行为是可以预期的; 虽然,我很奇怪,test即使没有key一套或by陈述,它也会逐行评估条件:

> DT[, c := ifelse(v1 < 0 & v2 > 0, min(-v1, v2), 0)]
> DT
         date id  v1  v2   c
1: 2018-01-01  a   5 -20   0
2: 2018-01-01  b  -8  -6   0
3: 2018-01-01  c -11  14 -20
4: 2018-01-02  a  -6  -9   0
5: 2018-01-02  b  -7  -3   0
6: 2018-01-02  c -10  -5   0
7: 2018-01-03  a -13  19 -20
8: 2018-01-03  b  -2  12 -20
9: 2018-01-03  c -14 -16   0

Run Code Online (Sandbox Code Playgroud)

尝试#2)当我设置key并使用by条件时,没有任何变化,但我收到一条错误消息.

> setkey(DT, date, id)
> DT[, c := ifelse(v1 < 0 & v2 > 0, min(-v1, v2), 0), by = list(date, id)]
Error in `[.data.table`(DT, , `:=`(c, ifelse(v1 < 0 & v2 > 0, min(-v1,  : 
  Type of RHS ('integer') must match LHS ('double'). To check and coerce would impact performance too much for the fastest cases. Either change the type of the target column, or coerce the RHS of := yourself (e.g. by using 1L instead of 1)
> DT
         date id  v1  v2   c
1: 2018-01-01  a   5 -20   0
2: 2018-01-01  b  -8  -6   0
3: 2018-01-01  c -11  14 -20
4: 2018-01-02  a  -6  -9   0
5: 2018-01-02  b  -7  -3   0
6: 2018-01-02  c -10  -5   0
7: 2018-01-03  a -13  19 -20
8: 2018-01-03  b  -2  12 -20
9: 2018-01-03  c -14 -16   0

Run Code Online (Sandbox Code Playgroud)

由于每行的组合date和id唯一性,因此我更难理解为什么不对group每个行进行评估,在这种情况下,每行都要进行评估.

也许我需要使用.SDcols = .(date, id)和.SD中ifelse,但我不知道如何使用.SD中ifelse.

Answer 1

leb*_*nok 6

您需要使用pmin而不是min:

DT[, c := ifelse(v1 < 0 & v2 > 0, pmin(-v1, v2), 0)]

> DT
         date id  v1  v2  c
1: 2018-01-01  a   5 -20  0
2: 2018-01-01  b  -8  -6  0
3: 2018-01-01  c -11  14 11
4: 2018-01-02  a  -6  -9  0
5: 2018-01-02  b  -7  -3  0
6: 2018-01-02  c -10  -5  0
7: 2018-01-03  a -13  19 13
8: 2018-01-03  b  -2  12  2
9: 2018-01-03  c -14 -16  0

# see also:
?pmin

Run Code Online (Sandbox Code Playgroud)

pmax*()和pmin*()将一个或多个向量作为参数,将它们循环到公共长度并返回单个向量,给出参数向量的"并行"最大值(或最小值).

[稍后补充]

如果您首先更改列类型,原始代码也可以正常工作:

  DT[, v1:= as.numeric(v1)]   # was integer, converting to 'double'
  DT[, v2:= as.numeric(v2)]   # ---,,---
  DT[, c := ifelse(v1 < 0 & v2 > 0, min(-v1, v2), 0), by = list(date, id)]

Run Code Online (Sandbox Code Playgroud)

据我所知,data.table哲学不是让R"隐式"改变列类型,而是直到明确改变类型.

手册说:

与< - for data.frame不同,(可能很大的)LHS不会被强制匹配(通常很小的)RHS的类型.相反,如有必要,RHS会被强制匹配LHS的类型.如果这涉及将双精度值强制转换为整数列,则会给出警告(是否截断了小数数据).这样做的动机是效率.最好让前面的列类型正确并坚持下去.更改列类型是可能的,但故意更难:提供整列作为RHS.然后将此RHS加入到该列槽中,我们称之为plonk语法,或者如果您愿意,则替换列语法.通过需要构造新类型的全长向量,您可以更好地了解正在发生的事情,并且您确实打算更改列类型的代码的读者更清楚.

到现在为止还挺好.但是,当然,原始错误消息令人困惑.

 # To check and coerce would impact performance too much for the fastest cases.

Run Code Online (Sandbox Code Playgroud)

"对于最快的情况？" 这必须是最快的情况之一,因为数据集在微观上很小,我敢打赌如果data.table允许隐式类型转换,没有人会注意到这种情况下性能的影响.因此,此错误消息的主要动机似乎是包装作者想要执行他认为是良好实践的内容.

这也可以(没有类型转换):

 DT[, c := ifelse(v1 < 0 & v2 > 0, as.numeric(min(-v1, v2)), 0), by = list(date, id)]  # 1

Run Code Online (Sandbox Code Playgroud)

或者:

 DT[, c := ifelse(v1 < 0 & v2 > 0, min(-v1, v2), 0L), by = list(date, id)] # 2

Run Code Online (Sandbox Code Playgroud)

但是你不能一个接一个地运行最后两行 - #1和#2 - c必须首先删除该列.DT$c将在第一种情况下为数字,在第二种情况下为整数.

一些额外的实验

DT[, c:= NULL] 
DT[, c := ifelse(v1 < 0, v1, 0), by = list(date, id)] 
# error but DT$c col created with first element as NA
# the condition was FALSE for the first element, so numeric 0 became the first element of c
# ... but the next element would be integer, hence the error
DT$c # [1]  0 NA NA NA NA NA NA NA NA
DT[, c:= NULL] 
DT[, c := ifelse(v1 > 0, v1, 0), by = list(date, id)]
# error; DT$c column is integer, with 5 as first element and the rest as NA 
DT$c # [1]  5 NA NA NA NA NA NA NA NA
DT[, c:= NULL] 
DT[, c := ifelse(v1 < 0, as.numeric(v1), 0), by = list(date, id)] 
# works without error but results in numeric DT$c
is.numeric(DT$c) # TRUE
DT[, c := ifelse(v1 < 0, v1, 0L), by = list(date, id)]
# type error, DT$c was numeric and we are trying to add an integer column
DT[, c:= NULL] # deleting the c column again
DT[, c := ifelse(v1 < 0, v1, 0L), by = list(date, id)]
# no error now
is.integer(DT$c) # TRUE

Run Code Online (Sandbox Code Playgroud)

归档时间：	7 年，3 月前
查看次数：	545 次
最近记录：	7 年，3 月前