r data.table 避免 RHS 和 LHS 之间的类差异

Amy*_*y M 5 double integer r class data.table

我有一个包含一些组的数据集,我想计算每个组中满足特定条件的记录数。然后我想将结果扩展到每个组中的其余记录(即不满足条件的地方),因为我稍后要折叠表格。

我正在使用 data.table 来执行此操作,以及.N计算每个组中满足我的条件的记录数的函数。然后我获取每个组中所有值的最大值,将结果应用于每个组中的所有记录。我的数据集非常大(将近 500 万条记录)。

我不断收到以下错误:

  Error in `[.data.table`(dpart, , `:=`(clustersize4wk, max(clustersize4wk,  : 
  Type of RHS ('double') must match LHS ('integer'). To check and coerce would impact performance too much for the fastest cases. Either change the type of the target column, or coerce the RHS of := yourself (e.g. by using 1L instead of 1)
Run Code Online (Sandbox Code Playgroud)

起初,我假设 using.N生成一个整数,而按组获取值的最大值会生成一个双精度值,但情况似乎并非如此(在下面的玩具示例中,结果列的类保持为整数),我无法重现该问题。

为了说明,这里有一个例子:

# Example data:

mydt <- data.table(id = c("a", "a", "b", "b", "b", "c", "c", "c", "c", "d", "d", "d"),
                   grp = c("G1", "G1", "G1", "G1", "G1", "G2", "G2", "G2", "G2", "G2", "G2", "G2"),
                   name = c("Jack", "John", "Jill", "Joe", "Jim", "Julia", "Simran", "Delia", "Aurora", "Daniele", "Joan", "Mary"),
                   sex = c("m", "m", "f", "m", "m", "f", "m", "f", "f", "f", "f", "f"), 
                   age = c(2,12,29,15,30,75,5,4,7,55,43,39), 
                   reportweek = c("201740", "201750", "201801", "201801", "201801", "201748", "201748", "201749", "201750", "201752", "201752", "201801"))
Run Code Online (Sandbox Code Playgroud)

我正在计算每个组中男性的人数,如下所示:

mydt[sex == "m", csize := .N, by = id]

> is.integer(mydt$csize)
[1] TRUE
> is.double(mydt$csize)
[1] FALSE
Run Code Online (Sandbox Code Playgroud)

有些组不包含任何男性,因此为了避免进入Inf下一步,我将 NA 重新编码为 0:

mydt[ is.na(csize), csize := 0]
Run Code Online (Sandbox Code Playgroud)

然后我将结果扩展到每个组中的所有成员,如下所示:

mydt[, csize := max(csize, na.rm = T), by = id] 

> is.integer(mydt$csize)
[1] TRUE
> is.double(mydt$csize)
[1] FALSE
Run Code Online (Sandbox Code Playgroud)

这是错误出现在我的真实数据集中的点。如果我省略将 NA 重新编码为 0 的步骤,我可以使用示例数据重现错误;否则不会。同样使用我的真实数据集(尽管已将 NA 重新编码为 0),我仍然收到以下警告:

19: In max(clustersize4wk, na.rm = TRUE) :
  no non-missing arguments to max; returning -Inf 
Run Code Online (Sandbox Code Playgroud)

我该如何解决这个问题?

我的预期输出如下:

> mydt
    id grp    name sex age reportweek csize
 1:  a  G1    Jack   m   2     201740     2
 2:  a  G1    John   m  12     201750     2
 3:  b  G1    Jill   f  29     201801     2
 4:  b  G1     Joe   m  15     201801     2
 5:  b  G1     Jim   m  30     201801     2
 6:  c  G2   Julia   f  75     201748     1
 7:  c  G2  Simran   m   5     201748     1
 8:  c  G2   Delia   f   4     201749     1
 9:  c  G2  Aurora   f   7     201750     1
10:  d  G2 Daniele   f  55     201752     0
11:  d  G2    Joan   f  43     201752     0
12:  d  G2    Mary   f  39     201801     0
Run Code Online (Sandbox Code Playgroud)

MKR*_*MKR 3

实际的问题是csize. 其类型integer. 返回max类型double

修复方法可能是:

mydt[sex == "m", csize := as.double(.N), by = id]

mydt[, csize := max(csize, 0, na.rm = TRUE), by = id]
Run Code Online (Sandbox Code Playgroud)