Zac*_*ach 8 sorting row r matrix
我有一个大矩阵:
set.seed(1)
a <- matrix(runif(9e+07),ncol=300)
Run Code Online (Sandbox Code Playgroud)
我想对矩阵中的每一行进行排序:
> system.time(sorted <- t(apply(a,1,sort)))
user system elapsed
42.48 3.40 45.88
Run Code Online (Sandbox Code Playgroud)
我有很多RAM可以使用,但我想要一种更快的方法来执行此操作.
好吧,我不知道有很多方法可以在R中快速排序,问题是你只需要排序300个值,但很多次.不过,您可以通过直接调用sort.int和使用来获得一些额外的性能method='quick':
set.seed(1)
a <- matrix(runif(9e+07),ncol=300)
# Your original code
system.time(sorted <- t(apply(a,1,sort))) # 31 secs
# sort.int with method='quick'
system.time(sorted2 <- t(apply(a,1,sort.int, method='quick'))) # 27 secs
# using a for-loop is slightly faster than apply (and avoids transpose):
system.time({sorted3 <- a; for(i in seq_len(nrow(a))) sorted3[i,] <- sort.int(a[i,], method='quick') }) # 26 secs
Run Code Online (Sandbox Code Playgroud)
但更好的方法应该是使用并行包来并行排序矩阵的各个部分.但是,传输数据的开销似乎太大了,在我的机器上它开始交换,因为我"只"拥有8 GB的内存:
library(parallel)
cl <- makeCluster(4)
system.time(sorted4 <- t(parApply(cl,a,1,sort.int, method='quick'))) # Forever...
stopCluster(cl)
Run Code Online (Sandbox Code Playgroud)
Martin Morgan 的另一个优秀方法,无需使用任何外部包,以最快的方式从行中选择第 i 个最高值并分配给新列:
matrix(a[order(row(a), a)], ncol=ncol(a), byrow=TRUE)
Run Code Online (Sandbox Code Playgroud)
在同一链接中的注释下还有一个等效的按列排序的方法。
使用与 Craig 相同的数据的计时代码:
set.seed(1)
a <- matrix(runif(9e7),ncol=300)
use_for <- function(){
sorted3 <- a
for(i in seq_len(nrow(a)))
sorted3[i,] <- sort.int(a[i,], method='quick')
sorted3
}
microbenchmark::microbenchmark(times=3L,
t(apply(a,1,sort)),
t(apply(a,1,sort.int, method='quick')),
use_for(),
Rfast::rowSort(a),
t(apply(a,1,grr::sort2)),
mmtd=matrix(a[order(row(a), a)], ncol=ncol(a), byrow=TRUE)
)
Run Code Online (Sandbox Code Playgroud)
时间:
Unit: seconds
expr min lq mean median uq max neval
t(apply(a, 1, sort)) 24.233418 24.305339 24.389650 24.377260 24.467766 24.558272 3
t(apply(a, 1, sort.int, method = "quick")) 17.024010 17.156722 17.524487 17.289433 17.774726 18.260019 3
use_for() 13.384958 13.873367 14.131813 14.361776 14.505241 14.648705 3
Rfast::rowSort(a) 3.758765 4.607609 5.136865 5.456452 5.825914 6.195377 3
t(apply(a, 1, grr::sort2)) 9.810774 9.955199 10.310328 10.099624 10.560106 11.020587 3
mmtd 6.147010 6.177769 6.302549 6.208528 6.380318 6.552108 3
Run Code Online (Sandbox Code Playgroud)
为了呈现更完整的图片,对字符类进行另一个测试(排除Rfast::rowSort它,因为它无法处理字符类):
set.seed(1)
a <- matrix(sample(letters, 9e6, TRUE),ncol=300)
microbenchmark::microbenchmark(times=1L,
t(apply(a,1,sort)),
t(apply(a,1,sort.int, method='quick')),
use_for(),
#Rfast::rowSort(a),
t(apply(a,1,grr::sort2)),
mmtd=matrix(a[order(row(a), a, method="radix")], ncol=ncol(a), byrow=TRUE)
)
Run Code Online (Sandbox Code Playgroud)
时间:
Unit: milliseconds
expr min lq mean median uq max neval
t(apply(a, 1, sort)) 14848.4356 14848.4356 14848.4356 14848.4356 14848.4356 14848.4356 1
t(apply(a, 1, sort.int, method = "quick")) 15061.0993 15061.0993 15061.0993 15061.0993 15061.0993 15061.0993 1
use_for() 14144.1264 14144.1264 14144.1264 14144.1264 14144.1264 14144.1264 1
t(apply(a, 1, grr::sort2)) 1831.1429 1831.1429 1831.1429 1831.1429 1831.1429 1831.1429 1
mmtd 440.9158 440.9158 440.9158 440.9158 440.9158 440.9158 1
Run Code Online (Sandbox Code Playgroud)
头对头:
set.seed(1)
a <- matrix(sample(letters, 9e7, TRUE),ncol=300)
microbenchmark::microbenchmark(times=1L,
t(apply(a,1,grr::sort2)),
mmtd=matrix(a[order(row(a), a, method="radix")], ncol=ncol(a), byrow=TRUE)
)
Run Code Online (Sandbox Code Playgroud)
时间:
Unit: seconds
expr min lq mean median uq max neval
t(apply(a, 1, grr::sort2)) 19.273225 19.273225 19.273225 19.273225 19.273225 19.273225 1
mmtd 3.854117 3.854117 3.854117 3.854117 3.854117 3.854117 1
Run Code Online (Sandbox Code Playgroud)
R版本:
R version 4.0.3 (2020-10-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18363)
Run Code Online (Sandbox Code Playgroud)
该软件包grr包含一个替代排序方法,可用于加快此特定操作的速度(我已减小了矩阵大小,因此该基准测试不会永远花费):
> set.seed(1)
> a <- matrix(runif(9e+06),ncol=300)
> microbenchmark::microbenchmark(sorted <- t(apply(a,1,sort))
+ ,sorted2 <- t(apply(a,1,sort.int, method='quick'))
+ ,sorted3 <- t(apply(a,1,grr::sort2)),times=3,unit='s')
Unit: seconds
expr min lq mean median uq max neval
sorted <- t(apply(a, 1, sort)) 1.7699799 1.865829 1.961853 1.961678 2.057790 2.153902 3
sorted2 <- t(apply(a, 1, sort.int, method = "quick")) 1.6162934 1.619922 1.694914 1.623551 1.734224 1.844898 3
sorted3 <- t(apply(a, 1, grr::sort2)) 0.9316073 1.003978 1.050569 1.076348 1.110049 1.143750 3
Run Code Online (Sandbox Code Playgroud)
当矩阵包含字符时,差异会变得非常明显:
> set.seed(1)
> a <- matrix(sample(letters,size = 9e6,replace = TRUE),ncol=300)
> microbenchmark::microbenchmark(sorted <- t(apply(a,1,sort))
+ ,sorted2 <- t(apply(a,1,sort.int, method='quick'))
+ ,sorted3 <- t(apply(a,1,grr::sort2)),times=3)
Unit: seconds
expr min lq mean median uq max neval
sorted <- t(apply(a, 1, sort)) 15.436045 15.479742 15.552009 15.523440 15.609991 15.69654 3
sorted2 <- t(apply(a, 1, sort.int, method = "quick")) 15.099618 15.340577 15.447823 15.581536 15.621925 15.66231 3
sorted3 <- t(apply(a, 1, grr::sort2)) 1.728663 1.733756 1.780737 1.738848 1.806774 1.87470 3
Run Code Online (Sandbox Code Playgroud)
这三个结果均相同。
> identical(sorted,sorted2,sorted3)
[1] TRUE
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
2810 次 |
| 最近记录: |