Ed *_*les 58 aggregate r dplyr data.table
我希望(1)通过一个变量()分组数据State,(2)在每个组内找到另一个变量(Employees)的最小值行,以及(3)提取整行.
(1)和(2)是简单的单行,我觉得(3)也应该是,但我不能得到它.
这是一个示例数据集:
> data
State Company Employees
1 AK A 82
2 AK B 104
3 AK C 37
4 AK D 24
5 RI E 19
6 RI F 118
7 RI G 88
8 RI H 42
data <- structure(list(State = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L), .Label = c("AK", "RI"), class = "factor"), Company = structure(1:8, .Label = c("A",
"B", "C", "D", "E", "F", "G", "H"), class = "factor"), Employees = c(82L,
104L, 37L, 24L, 19L, 118L, 88L, 42L)), .Names = c("State", "Company",
"Employees"), class = "data.frame", row.names = c(NA, -8L))
Run Code Online (Sandbox Code Playgroud)
min按组计算很简单,使用aggregate:
> aggregate(Employees ~ State, data, function(x) min(x))
State Employees
1 AK 24
2 RI 19
Run Code Online (Sandbox Code Playgroud)
......或者data.table:
> library(data.table)
> DT <- data.table(data)
> DT[ , list(Employees = min(Employees)), by = State]
State Employees
1: AK 24
2: RI 19
Run Code Online (Sandbox Code Playgroud)
但是,如何提取与这些min值对应的整行,即还包括Company在结果中?
Señ*_*r O 50
稍微优雅一点:
library(data.table)
DT[ , .SD[which.min(Employees)], by = State]
State Company Employees
1: AK D 24
2: RI E 19
Run Code Online (Sandbox Code Playgroud)
比使用更优雅.SD,但速度更快(对于有许多组的数据):
DT[DT[ , .I[which.min(Employees)], by = State]$V1]
Run Code Online (Sandbox Code Playgroud)
此外,只需更换的表达which.min(Employees)与Employees == min(Employees),如果你的数据集有多个相同的最小值和你想子集所有的人.
另请参见使用data.table的group子集.
ags*_*udy 46
一个dplyr解决方案:
library(dplyr)
data %>%
group_by(State) %>%
slice(which.min(Employees))
Run Code Online (Sandbox Code Playgroud)
Dav*_*urg 27
由于这是谷歌的热门话题,我想我会添加一些我觉得有用的其他选项.这个想法基本上是安排一次Employees,然后只是采取独特的State
要么使用 data.table
library(data.table)
unique(setDT(data)[order(Employees)], by = "State")
# State Company Employees
# 1: RI E 19
# 2: AK D 24
Run Code Online (Sandbox Code Playgroud)
或者,我们也可以先订购然后再订购.SD.这两个操作都在重新发布的data.table版本中进行了优化,并且order看起来是触发器data.table:::forderv,同时.SD[1L]触发器Gforce
setDT(data)[order(Employees), .SD[1L], by = State, verbose = TRUE] # <- Added verbose
# order optimisation is on, i changed from 'order(...)' to 'forder(DT, ...)'.
# i clause present and columns used in by detected, only these subset: State
# Finding groups using forderv ... 0 sec
# Finding group sizes from the positions (can be avoided to save RAM) ... 0 sec
# Getting back original order ... 0 sec
# lapply optimization changed j from '.SD[1L]' to 'list(Company[1L], Employees[1L])'
# GForce optimized j to 'list(`g[`(Company, 1L), `g[`(Employees, 1L))'
# Making each group and running j (GForce TRUE) ... 0 secs
# State Company Employees
# 1: RI E 19
# 2: AK D 24
Run Code Online (Sandbox Code Playgroud)
要么 dplyr
library(dplyr)
data %>%
arrange(Employees) %>%
distinct(State, .keep_all = TRUE)
# State Company Employees
# 1 RI E 19
# 2 AK D 24
Run Code Online (Sandbox Code Playgroud)
从@Khashaas借来的另一个有趣的想法很棒的答案(mult = "first"为了处理多个匹配而进行小的修改)是首先找到每组的最小值,然后再执行二进制连接.这样做的好处是可以利用data.tables gmin函数(跳过评估开销)和二进制连接功能
tmp <- setDT(data)[, .(Employees = min(Employees)), by = State]
data[tmp, on = .(State, Employees), mult = "first"]
# State Company Employees
# 1: AK D 24
# 2: RI E 19
Run Code Online (Sandbox Code Playgroud)
一些基准
library(data.table)
library(dplyr)
library(plyr)
library(stringi)
library(microbenchmark)
set.seed(123)
N <- 1e6
data <- data.frame(State = stri_rand_strings(N, 2, '[A-Z]'),
Employees = sample(N*10, N, replace = TRUE))
DT <- copy(data)
setDT(DT)
DT2 <- copy(DT)
str(DT)
str(DT2)
microbenchmark("(data.table) .SD[which.min]: " = DT[ , .SD[which.min(Employees)], by = State],
"(data.table) .I[which.min]: " = DT[DT[ , .I[which.min(Employees)], by = State]$V1],
"(data.table) order/unique: " = unique(DT[order(Employees)], by = "State"),
"(data.table) order/.SD[1L]: " = DT[order(Employees), .SD[1L], by = State],
"(data.table) self join (on):" = {
tmp <- DT[, .(Employees = min(Employees)), by = State]
DT[tmp, on = .(State, Employees), mult = "first"]},
"(data.table) self join (setkey):" = {
tmp <- DT2[, .(Employees = min(Employees)), by = State]
setkey(tmp, State, Employees)
setkey(DT2, State, Employees)
DT2[tmp, mult = "first"]},
"(dplyr) slice(which.min): " = data %>% group_by(State) %>% slice(which.min(Employees)),
"(dplyr) arrange/distinct: " = data %>% arrange(Employees) %>% distinct(State, .keep_all = TRUE),
"(dplyr) arrange/group_by/slice: " = data %>% arrange(Employees) %>% group_by(State) %>% slice(1),
"(plyr) ddply/which.min: " = ddply(data, .(State), function(x) x[which.min(x$Employees),]),
"(base) by: " = do.call(rbind, by(data, data$State, function(x) x[which.min(x$Employees), ])))
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# (data.table) .SD[which.min]: 119.66086 125.49202 145.57369 129.61172 152.02872 267.5713 100 d
# (data.table) .I[which.min]: 12.84948 13.66673 19.51432 13.97584 15.17900 109.5438 100 a
# (data.table) order/unique: 52.91915 54.63989 64.39212 59.15254 61.71133 177.1248 100 b
# (data.table) order/.SD[1L]: 51.41872 53.22794 58.17123 55.00228 59.00966 145.0341 100 b
# (data.table) self join (on): 44.37256 45.67364 50.32378 46.24578 50.69411 137.4724 100 b
# (data.table) self join (setkey): 14.30543 15.28924 18.63739 15.58667 16.01017 106.0069 100 a
# (dplyr) slice(which.min): 82.60453 83.64146 94.06307 84.82078 90.09772 186.0848 100 c
# (dplyr) arrange/distinct: 344.81603 360.09167 385.52661 379.55676 395.29463 491.3893 100 e
# (dplyr) arrange/group_by/slice: 367.95924 383.52719 414.99081 397.93646 425.92478 557.9553 100 f
# (plyr) ddply/which.min: 506.55354 530.22569 568.99493 552.65068 601.04582 727.9248 100 g
# (base) by: 1220.38286 1291.70601 1340.56985 1344.86291 1382.38067 1512.5377 100 h
Run Code Online (Sandbox Code Playgroud)
基函数by通常用于处理data.frames中的块数据.例如
by(data, data$State, function(x) x[which.min(x$Employees), ] )
Run Code Online (Sandbox Code Playgroud)
它确实返回列表中的数据,但您可以使用它来折叠它
do.call(rbind, by(data, data$State, function(x) x[which.min(x$Employees), ] ))
Run Code Online (Sandbox Code Playgroud)
在基础中,您可以使用ave获取min每个组并将其与比较Employees并获取逻辑向量来子集data.frame.
data[data$Employees == ave(data$Employees, data$State, FUN=min),]
# State Company Employees
#4 AK D 24
#5 RI E 19
Run Code Online (Sandbox Code Playgroud)
或者比较函数中已有的内容。
data[as.logical(ave(data$Employees, data$State, FUN=function(x) x==min(x))),]
#data[ave(data$Employees, data$State, FUN=function(x) x==min(x))==1,] #Variant
# State Company Employees
#4 AK D 24
#5 RI E 19
Run Code Online (Sandbox Code Playgroud)