我有以下数据表:
dt <- fread("
ID | EO_1 | EO_2 | EO_3 | GROUP
ID_001 | 0.5 | 1.2 | | A
ID_002 | | | | A
ID_003 | | | | A
ID_004 | | | | A
ID_001 | 0.4 | 2.5 | | B
ID_002 | | | | B
ID_003 | | | | B
ID_004 | | | | B
",
sep = "|",
colClasses = c("character", "numeric", "numeric", "numeric", "character"))
Run Code Online (Sandbox Code Playgroud)
我正在尝试执行一些逐行操作,这些操作有时取决于前一行的数据。进一步来说:
calc_EO_1 <- function(
EO_1,
EO_2
){
EO_1 <- shift(EO_1, type = "lag") * shift(EO_2, type = "lag")
return(EO_1)
}
calc_EO_2 <- function(
EO_1,
EO_2,
EO_3
){
EO_2 <- EO_1 * shift(EO_2, type = "lag") * shift(EO_3, type = "lag")
return(EO_2)
}
calc_EO_3 <- function(
EO_1,
EO_2
){
EO_3 <- EO_1 * EO_2
return(EO_3)
}
Run Code Online (Sandbox Code Playgroud)
最后一个需要从第一行计算,因为它依赖于其他字段(这应该很容易),之后,所有三个操作都必须连续和按行进行。
我最接近的是以下内容:
first_row_bygroup_index <- dt[, .I[1], by = GROUP]$V1
dt[first_row_bygroup_index,
EO_3 := calc_EO_3(EO_1, EO_2)
]
dt[!first_row_bygroup_index,
`:=` (
EO_1 = calc_EO_1(EO_1, EO_2),
EO_2 = calc_EO_2(EO_1, EO_2, EO_3),
EO_3 = calc_EO_3(EO_1, EO_2)
),
by = row.names(dt[!first_row_bygroup_index])]
Run Code Online (Sandbox Code Playgroud)
但它只正确计算第一行:
ID | EO_1 | EO_2 | EO_3 | GROUP
ID_001 | 0.5 | 1.2 | 0.6 | A
ID_002 | | | | A
ID_003 | | | | A
ID_004 | | | | A
ID_001 | 0.4 | 2.5 | 1.0 | B
ID_002 | | | | B
ID_003 | | | | B
ID_004 | | | | B
Run Code Online (Sandbox Code Playgroud)
作为那些空间 NA。
我不认为我离解决方案太远了,但我无法找到使它起作用的方法。问题是我无法使用来自子集外部的行在行子集中执行操作。
编辑 我错过了预期的结果:
ID | EO_1 | EO_2 | EO_3 | GROUP
ID_001 | 0.50000000 | 1.20000000 | 0.60000000 | A
ID_002 | 0.60000000 | 0.43200000 | 0.25920000 | A
ID_003 | 0.25920000 | 0.02902376 | 0.00752296 | A
ID_004 | 0.00752296 | 0.00000164 | 0.00000001 | A
ID_001 | 0.40000000 | 2.50000000 | 1.00000000 | B
ID_002 | 1.00000000 | 2.50000000 | 2.50000000 | B
ID_003 | 2.50000000 | 15.62500000 | 39.06250000 | B
ID_004 | 39.06250000 | 23841.8580000 | 931322.57810000 | B
Run Code Online (Sandbox Code Playgroud)
新编辑 我想出了以下代码段,但我宁愿等一下,看看是否有人可以获得比这个更有效的解决方案:
while(any(is.na(dt))){
dt[, `:=` (
EO_3 = calc_EO_3(EO_1, EO_2),
EO_1 = ifelse(ID == "ID_001", EO_1, calc_EO_1(EO_1, EO_2)),
EO_2 = ifelse(ID == "ID_001", EO_2, calc_EO_2(EO_1, EO_2, EO_3))
)]
}
Run Code Online (Sandbox Code Playgroud)
我想出了一个类似的 dplyr 解决方案,以及那个丑陋的 while 循环修复。关键是要找到一种方法来进行逐行计算,该方法可以从之前的行中获取信息,即使之前的行不在所选子集的范围内。我希望有人可以改进这一点,所以我会稍等片刻,然后将其标记为解决方案。
这是另一种可能的方法:
dt[!is.na(EO_1), EO_3 := EO_1 * EO_2, by=.(GROUP)]
dt[ID!="ID_001", c("EO_1", "EO_2", "EO_3") :=
dt[,
{
eo1 <- EO_1[1L]; eo2 <- EO_2[1L]; eo3 <- EO_3[1L]
.SD[ID!="ID_001",
{
eo1 <- eo1 * eo2
eo2 <- eo1 * eo2 * eo3
eo3 <- eo1 * eo2
.(eo1, eo2, eo3)
},
by=.(ID)]
},
by=.(GROUP)][, -1L:-2L]
]
Run Code Online (Sandbox Code Playgroud)
输出:
ID EO_1 EO_2 EO_3 GROUP
1: ID_001 0.50000000 1.200000e+00 6.000000e-01 A
2: ID_002 0.60000000 4.320000e-01 2.592000e-01 A
3: ID_003 0.25920000 2.902376e-02 7.522960e-03 A
4: ID_004 0.00752296 1.642598e-06 1.235720e-08 A
5: ID_001 0.40000000 2.500000e+00 1.000000e+00 B
6: ID_002 1.00000000 2.500000e+00 2.500000e+00 B
7: ID_003 2.50000000 1.562500e+01 3.906250e+01 B
8: ID_004 39.06250000 2.384186e+04 9.313226e+05 B
Run Code Online (Sandbox Code Playgroud)