在分组之后q,然后将w e分别从两个不同的列中提取两个最大值
输入数据:
q <- c(503,503,503,503,503,503,503,503,503,503,503,503,503,510,510,510,510,510,510,510,510,510,510,510,510,525,526,526)
w <- c(56,56,56,56,56,56,56,56,56,56,56,56,56,56,56,56,56,56,56,56,56,56,56,56,56,56,56,56)
e <- c(26,26,26,26,26,27,28,28,28,28,28,28,28,28,28,28,28,28,28,28,29,30,30,30,30,33,33,33)
r <- data.frame(q,w,e, stringsAsFactors = FALSE)
Run Code Online (Sandbox Code Playgroud)
码:
r %>% group_by(q) %>% slice(which.max(w & e))
Run Code Online (Sandbox Code Playgroud)
我的输出:
q w e
<dbl> <dbl> <dbl>
1 503. 56. 26.
2 510. 56. 28.
3 525. 56. 33.
4 526. 56. 33.
Run Code Online (Sandbox Code Playgroud)
预期产出:
q w e
1 503 56 28
2 510 56 30
3 525 56 33
4 526 56 33
Run Code Online (Sandbox Code Playgroud)
宁愿使用%>%和slice命令作为上面的代码,而不是单独找到max q$w q$e然后合并q(希望避免merge因为我的实际数据大object.size~2GB)
这是快速data.table解决方案,可以很好地扩展到2GB数据集.
library(data.table)
dt <- data.table(r)
dt[, lapply(.SD, max, na.rm=TRUE), by=q ]
Run Code Online (Sandbox Code Playgroud)
结果
q w e
1: 503 56 28
2: 510 56 30
3: 525 56 33
4: 526 56 33
Run Code Online (Sandbox Code Playgroud)
标杆
microbenchmark(data.table = dt[, lapply(.SD, max, na.rm=TRUE), by=q ],
dplyr1 = r %>% group_by(q) %>% summarise_all(max),
base = do.call(rbind, by(r, r$q, function(x)
data.frame(q = unique(x$q), w = max(x$w), e = max(x$e)))), times = 50
)
Run Code Online (Sandbox Code Playgroud)
结果
Unit: microseconds
expr min lq mean median uq max neval
data.table 810.240 1060.267 1447.979 1192.107 1332.054 14260.901 50
dplyr1 1562.027 1686.613 1857.382 1759.574 1869.226 3617.279 50
base 1925.973 2088.107 2448.162 2226.986 2485.760 7395.837 50
Run Code Online (Sandbox Code Playgroud)
显然data.table是最快的.