从组中的两列中提取最大值

bea*_*111 2 r dplyr

在分组之后q,然后将w e分别从两个不同的列中提取两个最大值

输入数据:

q <- c(503,503,503,503,503,503,503,503,503,503,503,503,503,510,510,510,510,510,510,510,510,510,510,510,510,525,526,526)
w <- c(56,56,56,56,56,56,56,56,56,56,56,56,56,56,56,56,56,56,56,56,56,56,56,56,56,56,56,56)
e <- c(26,26,26,26,26,27,28,28,28,28,28,28,28,28,28,28,28,28,28,28,29,30,30,30,30,33,33,33)
r <- data.frame(q,w,e, stringsAsFactors = FALSE)
Run Code Online (Sandbox Code Playgroud)

码:

r %>% group_by(q) %>% slice(which.max(w & e))
Run Code Online (Sandbox Code Playgroud)

我的输出:

  q     w     e
 <dbl> <dbl> <dbl>
1  503.   56.   26.
2  510.   56.   28.
3  525.   56.   33.
4  526.   56.   33.
Run Code Online (Sandbox Code Playgroud)

预期产出:

    q   w  e
1  503 56 28
2  510 56 30
3  525 56 33
4  526 56 33
Run Code Online (Sandbox Code Playgroud)

宁愿使用%>%slice命令作为上面的代码,而不是单独找到max q$w q$e然后合并q(希望避免merge因为我的实际数据大object.size~2GB)

Ran*_*man 5

这是快速data.table解决方案,可以很好地扩展到2GB数据集.

library(data.table)
dt <- data.table(r)
dt[, lapply(.SD, max, na.rm=TRUE), by=q ]
Run Code Online (Sandbox Code Playgroud)

结果

    q  w  e
1: 503 56 28
2: 510 56 30
3: 525 56 33
4: 526 56 33
Run Code Online (Sandbox Code Playgroud)

标杆

microbenchmark(data.table = dt[, lapply(.SD, max, na.rm=TRUE), by=q ],
               dplyr1 = r %>% group_by(q) %>% summarise_all(max),
               base = do.call(rbind, by(r, r$q, function(x)
               data.frame(q = unique(x$q), w = max(x$w), e = max(x$e)))), times = 50
)
Run Code Online (Sandbox Code Playgroud)

结果

Unit: microseconds
       expr      min       lq     mean   median       uq       max neval
 data.table  810.240 1060.267 1447.979 1192.107 1332.054 14260.901    50
     dplyr1 1562.027 1686.613 1857.382 1759.574 1869.226  3617.279    50
       base 1925.973 2088.107 2448.162 2226.986 2485.760  7395.837    50
Run Code Online (Sandbox Code Playgroud)

显然data.table是最快的.