我有一个如下所示的数据框:
df <- data.frame(Site=rep(paste0('site', 1:5), 50),
Month=sample(1:12, 50, replace=T),
Count=(sample(1:1000, 50, replace=T)))
Run Code Online (Sandbox Code Playgroud)
我想删除所有网站上计数总是<每月最大月数的5%的网站.
所有网站的最高月度计数为:
library(plyr)
ddply(df, .(Month), summarise, Max.Count=max(Count))
Run Code Online (Sandbox Code Playgroud)
如果将1的计数分配给site5,则其计数始终<所有站点的最大月计数的5%.因此我希望删除site5.
df$Count[df$Site=='site5'] <- 1
Run Code Online (Sandbox Code Playgroud)
但是,在为site2分配新值后,其中一些计数<最大月度计数的5%,而其他计数> 5%.因此我不希望删除site2.
df$Count[df$Site=='site2'] <- ceiling(seq(1, 1000, length.out=20))
Run Code Online (Sandbox Code Playgroud)
如何将数据框子集化以删除计数总是<每月最大计数的5%的任何网站?如果问题不清楚,请告诉我,我会修改.
一个data.table解决方案:
require(data.table)
set.seed(45)
df <- data.frame(Site=rep(paste0('site', 1:5), 50),
Month=sample(1:12, 50, replace=T),
Count=(sample(1:1000, 50, replace=T)))
df$Count[df$Site=='site5'] <- 1
dt <- data.table(df, key=c("Month", "Site"))
# set max.count per site+month
dt[, max.count := max(Count), by = list(Month)]
# get the site that is TRUE for all months it is present
d1 <- dt[, list(check = all(Count < .05 * max.count)), by = list(Month, Site)]
sites <- as.character(d1[, all(check == TRUE), by=Site][V1 == TRUE, Site])
dt.out <- dt[Site != sites][, max.count := NULL]
# Site Month Count
# 1: site1 1 939
# 2: site1 1 939
# 3: site1 1 939
# 4: site1 1 939
# 5: site1 1 939
# ---
# 196: site2 12 969
# 197: site2 12 684
# 198: site2 12 613
# 199: site2 12 969
# 200: site2 12 684
Run Code Online (Sandbox Code Playgroud)
这是一个plyr解决方案:
## df2$test is true if Count >= max(Count)*0.05 for this month
df2 <- ddply(df, .(Month), transform, test=Count>=(max(Count)*0.05))
## For each site, test$keep is true if at least one count is >= max(Count)*0.05 for this month
test <- ddply(df2, .(Site), summarise, keep=sum(test)>0)
## Subsetting
sites <- test$Site[test$keep]
df[df$Site %in% sites,]
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
296 次 |
| 最近记录: |