我有一个数据集,每个参与者都有几个时间评估.我想为每个参与者选择最后一次评估.我的数据集如下所示:
ID week outcome
1 2 14
1 4 28
1 6 42
4 2 14
4 6 46
4 9 64
4 9 71
4 12 85
9 2 14
9 4 28
9 6 51
9 9 66
9 12 84
Run Code Online (Sandbox Code Playgroud)
我想只为每个参与者选择最后一次观察/评估,但我只有周数作为每个参与者的指标.这怎么可能在R(或excel?)
提前致谢,
尼基
Jos*_*ien 11
这是一个基础R方法:
do.call("rbind",
by(df, INDICES=df$ID, FUN=function(DF) DF[which.max(DF$week), ]))
ID week outcome
1 1 6 42
4 4 12 85
9 9 12 84
Run Code Online (Sandbox Code Playgroud)
或者,该data.table软件包提供了一种简洁而富有表现力的语言,用于执行此类数据框操作:
library(data.table)
dt <- data.table(df, key="ID")
dt[, .SD[which.max(outcome), ], by=ID]
# ID week outcome
# [1,] 1 6 42
# [2,] 4 12 85
# [3,] 9 12 84
# Same but much faster.
# (Actually, only the same as long as there are no ties for max(outcome)..)
dt[ dt[,outcome==max(outcome),by=ID][[2]] ] # same, but much faster.
# If there are ties for max(outcome), the following will still produce
# the same results as the method using .SD, but will be faster
i1 <- dt[,which.max(outcome), by=ID][[2]]
i2 <- dt[,.N, by=ID][[2]]
dt[i1 + cumsum(i2) - i2,]
Run Code Online (Sandbox Code Playgroud)
最后,这是一个plyr基于解决方案
library(plyr)
ddply(df, .(ID), function(X) X[which.max(X$week), ])
# ID week outcome
# 1 1 6 42
# 2 4 12 85
# 3 9 12 84
Run Code Online (Sandbox Code Playgroud)
小智 9
如果您只是在寻找每个人ID的最后一个观察点,那么应该使用简单的两行代码.在可能的情况下,我总是寻求简单的基础解决方案,而有多种方法来解决问题总是很好的.
dat[order(dat$ID,dat$Week),] # Sort by ID and week
dat[!duplicated(dat$ID, fromLast=T),] # Keep last observation per ID
ID Week Outcome
3 1 6 42
8 4 12 85
13 9 12 84
Run Code Online (Sandbox Code Playgroud)