现在,我有以下由original.df %.% group_by(Category) %.% tally() %.% arrange(desc(n)).创建的data.frame .
DF <- structure(list(Category = c("E", "K", "M", "L", "I", "A",
"S", "G", "N", "Q"), n = c(163051, 127133, 106680, 64868, 49701,
47387, 47096, 45601, 40056, 36882)), .Names = c("Category",
"n"), row.names = c(NA, 10L), class = c("tbl_df", "tbl", "data.frame"
))
Category n
1 E 163051
2 K 127133
3 M 106680
4 L 64868
5 I 49701
6 A 47387
7 S 47096
8 G 45601
9 N 40056
10 Q 36882
Run Code Online (Sandbox Code Playgroud)
我想从nie排名最低的类别中创建一个"其他"字段
Category n
1 E 163051
2 K 127133
3 M 106680
4 L 64868
5 I 49701
6 Other 217022
Run Code Online (Sandbox Code Playgroud)
现在,我在做
rbind(filter(DF, rank(rev(n)) <= 5),
summarise(filter(DF, rank(rev(n)) > 5), Category = "Other", n = sum(n)))
Run Code Online (Sandbox Code Playgroud)
它将不在前5名中的所有类别折叠为其他类别.
但我很好奇是否有更好的方式dplyr或其他现有的包."更好"我的意思是更简洁/可读.我也对使用更聪明或更灵活的方法进行选择的方法感兴趣Other.
这是另一种方法,假设每个类别(至少前5个)只出现一次:
df %.%
arrange(desc(n)) %.% #you could skip this step since you arranged the input df already according to your question
mutate(Category = ifelse(1:n() > 5, "Other", Category)) %.%
group_by(Category) %.%
summarize(n = sum(n))
# Category n
#1 E 163051
#2 I 49701
#3 K 127133
#4 L 64868
#5 M 106680
#6 Other 217022
Run Code Online (Sandbox Code Playgroud)
编辑:
我只是注意到我的输出不再是减少顺序n.在再次运行代码之后,我发现订单一直保留到之后,group_by(Category)但是当我summarize之后运行时,订单消失了(或者更确切地说,它似乎是按顺序排序Category).这应该是那样的吗?
以下是三种方式:
m <- 5 #number of top results to show in final table (excl. "Other")
n <- m+1
#preserves the order (or better: reesatblishes it by index)
df <- arrange(df, desc(n)) %.% #this could be skipped if data already ordered
mutate(idx = 1:n(), Category = ifelse(idx > m, "Other", Category)) %.%
group_by(Category) %.%
summarize(n = sum(n), idx = first(idx)) %.%
arrange(idx) %.%
select(-idx)
#doesnt preserve the order (same result as in first dplyr solution, ordered by Category)
df[order(df$n, decreasing=T),] #this could be skipped if data already ordered
df[n:nrow(df),1] <- "Other"
df <- aggregate(n ~ Category, data = df, FUN = "sum")
#preserves the order (without extra index)
df[order(df$n, decreasing=T),] #this could be skipped if data already ordered
df[n:nrow(df),1] <- "Other"
df[n,2] <- sum(df$n[df$Category == "Other"])
df <- df[1:n,]
Run Code Online (Sandbox Code Playgroud)
不同的包/不同的语法版本:
library(data.table)
dt = as.data.table(DF)
dt[order(-n), # your data is already sorted, so this does nothing for it
if (.BY[[1]]) .SD else list("Other", sum(n)),
by = 1:nrow(dt) <= 5][, !"nrow", with = F]
# Category n
#1: E 163051
#2: K 127133
#3: M 106680
#4: L 64868
#5: I 49701
#6: Other 217022
Run Code Online (Sandbox Code Playgroud)