我有data.frame两列:year和score.这些年从2000年到2012年,每年都可以多次列出.在分数列中,我列出了每一年的所有分数,每行具有不同的分数.
我想做的是过滤data.frame所以只保留每年最高分数的行.
所以,如果我有一个小例子
year score
2000 18
2001 22
2000 21
Run Code Online (Sandbox Code Playgroud)
我想回来
year score
2001 22
2000 21
Run Code Online (Sandbox Code Playgroud)
运用 plyr
require(plyr)
set.seed(45)
df <- data.frame(year=sample(2000:2012, 25, replace=T), score=sample(25))
ddply(df, .(year), summarise, max.score=max(score))
Run Code Online (Sandbox Code Playgroud)
运用 data.table
require(data.table)
dt <- data.table(df, key="year")
dt[, list(max.score=max(score)), by=year]
Run Code Online (Sandbox Code Playgroud)
使用aggregate:
o <- aggregate(df$score, list(df$year) , max)
names(o) <- c("year", "max.score")
Run Code Online (Sandbox Code Playgroud)
使用ave:
df1 <- df
df1$max.score <- ave(df1$score, df1$year, FUN=max)
df1 <- df1[!duplicated(df1$year), ]
Run Code Online (Sandbox Code Playgroud)
编辑:如果有更多列,data.table解决方案将是最好的(我的意见:))
set.seed(45)
df <- data.frame(year=sample(2000:2012, 25, replace=T), score=sample(25),
alpha = sample(letters[1:5], 25, replace=T), beta=rnorm(25))
# convert to data.table with key=year
dt <- data.table(df, key="year")
# get the subset of data that matches this criterion
dt[, .SD[score %in% max(score)], by=year]
# year score alpha beta
# 1: 2000 20 b 0.8675148
# 2: 2001 21 e 1.5543102
# 3: 2002 22 c 0.6676305
# 4: 2003 18 a -0.9953758
# 5: 2004 23 d 2.1829996
# 6: 2005 25 b -0.9454914
# 7: 2007 17 e 0.7158021
# 8: 2008 12 e 0.6501763
# 9: 2011 24 a 0.7201334
# 10: 2012 19 d 1.2493954
Run Code Online (Sandbox Code Playgroud)
如果你知道sql,这更容易理解
library(sqldf)
sqldf('select year, max(score) from mydata group by year')
Run Code Online (Sandbox Code Playgroud)
更新(2016-01):现在您也可以使用dplyr
library(dplyr)
mydata %>% group_by(year) %>% summarise(max = max(score))
Run Code Online (Sandbox Code Playgroud)
使用基础包
> df
year score
1 2000 18
2 2001 22
3 2000 21
> aggregate(score ~ year, data=df, max)
year score
1 2000 21
2 2001 22
Run Code Online (Sandbox Code Playgroud)
编辑
如果您需要保留其他列,则可以merge使用aggregate以获取这些列
> df <- data.frame(year = c(2000, 2001, 2000), score = c(18, 22, 21) , hrs = c( 10, 11, 12))
> df
year score hrs
1 2000 18 10
2 2001 22 11
3 2000 21 12
> merge(aggregate(score ~ year, data=df, max), df, all.x=T)
year score hrs
1 2000 21 12
2 2001 22 11
Run Code Online (Sandbox Code Playgroud)