我有一个data.frame,在多个时间点为每个主题测得的值只有一个。它简化为:
> set.seed(42)
> x = data.frame(subject=rep(c('a', 'b', 'c'), 3), time=rep(c(1,2,3), each=3), value=rnorm(3*3, 0, 1))
> x
subject time value
1 a 1 1.37095845
2 b 1 -0.56469817
3 c 1 0.36312841
4 a 2 0.63286260
5 b 2 0.40426832
6 c 2 -0.10612452
7 a 3 1.51152200
8 b 3 -0.09465904
9 c 3 2.01842371
Run Code Online (Sandbox Code Playgroud)
我想计算value每个时间点和每个主题的变化。对于这个简单的例子,我当前的解决方案是:
> x$diff[x$time==1] = x$value[x$time==2] - x$value[x$time==1]
> x$diff[x$time==2] = x$value[x$time==3] - x$value[x$time==2]
> x
subject time value diff
1 a 1 1.37095845 -0.7380958
2 b 1 -0.56469817 0.9689665
3 c 1 0.36312841 -0.4692529
4 a 2 0.63286260 0.8786594
5 b 2 0.40426832 -0.4989274
6 c 2 -0.10612452 2.1245482
7 a 3 1.51152200 NA
8 b 3 -0.09465904 NA
9 c 3 2.01842371 NA
Run Code Online (Sandbox Code Playgroud)
...,然后删除最后一行。但是,在我的实际数据集中,存在更多的层次,time我需要对几列进行此操作,而不是对value。代码变得非常难看。有没有一种整洁的方式做到这一点?一个不假设行根据主题在主题中排序的解决方案time会很好。
我们可以使用data.table。将'data.frame'转换为'data.table'(setDT(x)),按'subject'分组,我们将下一个值(shift(value, type='lead'))与当前值的差取值,并分配(:=)输出以创建'Diff'列。
library(data.table)#v1.9.6+
setDT(x)[order(time),Diff := shift(value, type= 'lead') - value ,
by = subject]
# subject time value Diff
#1: a 1 1.37095845 -0.7380958
#2: b 1 -0.56469817 0.9689665
#3: c 1 0.36312841 -0.4692529
#4: a 2 0.63286260 0.8786594
#5: b 2 0.40426832 -0.4989274
#6: c 2 -0.10612452 2.1245482
#7: a 3 1.51152200 NA
#8: b 3 -0.09465904 NA
#9: c 3 2.01842371 NA
Run Code Online (Sandbox Code Playgroud)