r表示整个数据的数据更新列的子集

tuc*_*son 3 r data.table

我希望计算数据子集的一列的平均值,并将此平均值输入到整个数据的新列中.

这里有一些代码可以让事情更清晰:

t <- data.table(Label=c(0,1,0,1,1,1), x=c("aa","aa","aa","aa","bb","bb"), environment=c("train","train","test","test","train","test"))
t
   Label  x environment
1:     0 aa       train
2:     1 aa       train
3:     0 aa        test
4:     1 aa        test
5:     1 bb       train
6:     1 bb        test
setkey(t,x)
t[environment=="train",avg := mean(Label),by=c("x")]

t
   Label  x environment avg
1:     0 aa       train 0.5
2:     1 aa       train 0.5
3:     0 aa        test  NA
4:     1 aa        test  NA
5:     1 bb       train 1.0
6:     1 bb        test  NA
Run Code Online (Sandbox Code Playgroud)

上面的代码工作,除了它不更新环境=="test"的行,这是正常的,因为我在子集上做了除了那些的平均值.

所以我想保留子集的均值,但更新所有行的avg列,包括"test".

所以结果应该是:

t
   Label  x environment avg
1:     0 aa       train 0.5
2:     1 aa       train 0.5
3:     0 aa        test 0.5 # average calculated with train rows only
4:     1 aa        test 0.5 # average calculated with train rows only
5:     1 bb       train 1.0
6:     1 bb        test 1.0 # average calculated with train rows only
Run Code Online (Sandbox Code Playgroud)

Dav*_*urg 5

似乎这就是你所追求的

t[environment == "train", avg := mean(Label), by = x][, avg := mean(avg, na.rm = T), by= x]
t 

##   Label  x environment avg
## 1:     0 aa       train 0.5
## 2:     1 aa       train 0.5
## 3:     0 aa        test 0.5
## 4:     1 aa        test 0.5
## 5:     1 bb       train 1.0
## 6:     1 bb        test 1.0
Run Code Online (Sandbox Code Playgroud)