我希望计算数据子集的一列的平均值,并将此平均值输入到整个数据的新列中.
这里有一些代码可以让事情更清晰:
t <- data.table(Label=c(0,1,0,1,1,1), x=c("aa","aa","aa","aa","bb","bb"), environment=c("train","train","test","test","train","test"))
t
Label x environment
1: 0 aa train
2: 1 aa train
3: 0 aa test
4: 1 aa test
5: 1 bb train
6: 1 bb test
setkey(t,x)
t[environment=="train",avg := mean(Label),by=c("x")]
t
Label x environment avg
1: 0 aa train 0.5
2: 1 aa train 0.5
3: 0 aa test NA
4: 1 aa test NA
5: 1 bb train 1.0
6: 1 bb test NA
Run Code Online (Sandbox Code Playgroud)
上面的代码工作,除了它不更新环境=="test"的行,这是正常的,因为我在子集上做了除了那些的平均值.
所以我想保留子集的均值,但更新所有行的avg列,包括"test".
所以结果应该是:
t
Label x environment avg
1: 0 aa train 0.5
2: 1 aa train 0.5
3: 0 aa test 0.5 # average calculated with train rows only
4: 1 aa test 0.5 # average calculated with train rows only
5: 1 bb train 1.0
6: 1 bb test 1.0 # average calculated with train rows only
Run Code Online (Sandbox Code Playgroud)
似乎这就是你所追求的
t[environment == "train", avg := mean(Label), by = x][, avg := mean(avg, na.rm = T), by= x]
t
## Label x environment avg
## 1: 0 aa train 0.5
## 2: 1 aa train 0.5
## 3: 0 aa test 0.5
## 4: 1 aa test 0.5
## 5: 1 bb train 1.0
## 6: 1 bb test 1.0
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
379 次 |
| 最近记录: |