use*_*868 13 r count data.table
我有面板数据(科目/年),我希望只保留每年出现最多次数的科目.数据集很大所以我使用的是data.table包.有没有比我在下面尝试过的更优雅的解决方案?
library(data.table)
DT <- data.table(SUBJECT=c(rep('John',3), rep('Paul',2),
rep('George',3), rep('Ringo',2),
rep('John',2), rep('Paul',4),
rep('George',2), rep('Ringo',4)),
YEAR=c(rep(2011,10), rep(2012,12)),
HEIGHT=rnorm(22),
WEIGHT=rnorm(22))
DT
DT[, COUNT := .N, by='SUBJECT,YEAR']
DT[, MAXCOUNT := max(COUNT), by='YEAR']
DT <- DT[COUNT==MAXCOUNT]
DT <- DT[, c('COUNT','MAXCOUNT') := NULL]
DT
Run Code Online (Sandbox Code Playgroud)
Mat*_*wle 15
我不确定你会认为这很优雅,但如何:
DT[, COUNT := .N, by='SUBJECT,YEAR']
DT[, .SD[COUNT == max(COUNT)], by='YEAR']
Run Code Online (Sandbox Code Playgroud)
这基本上是如何应用于@SenorO评论by的i表达式.之后你仍然需要[,COUNT:=NULL]一个临时列,而不是两个.
.SD虽然出于速度原因我们不鼓励,但希望我们很快就能得到这个功能请求,以便可以删除建议:FR#2330优化.SD [i]查询以保持优雅,但使其更快更新..
不同的方法如下.它更快,更惯用,但可能被认为不那么优雅.
# Create a small aggregate table first. No need to use := on the big table.
i = DT[, .N, by='SUBJECT,YEAR']
# Find the even smaller subset. (Do as much as we can on the small aggregate.)
i = i[, .SD[N==max(N)], by=YEAR]
# Finally join the small subset of key values to the big table
setkey(DT, YEAR, SUBJECT)
DT[i]
Run Code Online (Sandbox Code Playgroud)
类似的东西在这里.