计算 Julia 中每对矩阵行之间的相关性的最有效方法是什么？

Question

计算 Julia 中每对矩阵行之间的相关性的最有效方法是什么？

假设我有一个名为mat：

\n

julia> mat = rand(1:10, 5, 3)\n5\xc3\x973 Matrix{Int64}:\n 2  4   3\n 5  3  10\n 5  7   5\n 9  5   7\n 4  9   6\n

Run Code Online (Sandbox Code Playgroud)\n

我想计算每对行之间的相关性mat（例如，cor(mat[1, :], mat[2, :]等等），最后得到一个相关矩阵。我为其编写了两个脚本，并且我将提供基准测试。然而，如果我能让它更快的话我会更高兴（因为我应该在大数据集上执行该过程，比如 2000x20 大小）。

\n

第一种方法

\n

一个非常简单的方法；首先，我创建一个初始化矩阵zeros，然后尝试用每对行上计算出的相关性来填充它。这不是一个好方法，因为我计算了两次必要的值（因为，例如，cor([mat[1, :], mat[3, :])等于cor([mat[3, :], mat[1, :])）：

\n

using Statistics\n\nfunction calc_corr(matrix::Matrix)\n    n::Int64 = size(matrix, 1)\n    corr_mat = zeros(Float64, n, n)\n\n    for (idx1, idx2)=Iterators.product(1:n, 1:n)\n        @inbounds corr_mat[idx1, idx2] = cor(\n            view(matrix, idx1, :),\n            view(matrix, idx2, :)\n        )\n    end\n\n    return corr_mat\nend\n

Run Code Online (Sandbox Code Playgroud)\n

第二种方法

\n

计算上三角部分，然后创建对称矩阵，以获得完整的相关矩阵：

\n

using LinearAlgebra\n\nfunction calc_corr2(matrix::Matrix)\n    n::Int64 = size(matrix, 1)\n    corr_mat = ones(Float64, n, n)\n\n    # find upper triangular indices\n    upper_triang_idx = findall(==(1), triu(ones(Int8, n, n), 1))\n\n    for (idx1, idx2)=Tuple.(upper_triang_idx)\n        @inbounds corr_mat[idx1, idx2] = cor(\n            view(matrix, idx1, :),\n            view(matrix, idx2, :)\n        )\n    end\n\n    corr_mat = Symmetric(corr_mat)\n    return corr_mat\nend\n

Run Code Online (Sandbox Code Playgroud)\n

标杆管理

\n

首先在我之前声明的一个小矩阵上（mat）：

\n

using BenchmarkTools\n\n@benchmark calc_corr($mat)\nBenchmarkTools.Trial: 10000 samples with 10 evaluations.\n Range (min \xe2\x80\xa6 max):  1.950 \xce\xbcs \xe2\x80\xa6   6.210 \xce\xbcs  \xe2\x94\x8a GC (min \xe2\x80\xa6 max): 0.00% \xe2\x80\xa6 0.00%\n Time  (median):     2.160 \xce\xbcs               \xe2\x94\x8a GC (median):    0.00%\n Time  (mean \xc2\xb1 \xcf\x83):   2.178 \xce\xbcs \xc2\xb1 289.600 ns  \xe2\x94\x8a GC (mean \xc2\xb1 \xcf\x83):  0.00% \xc2\xb1 0.00%\n\n  \xe2\x96\x87\xe2\x96\x87\xe2\x96\x82 \xe2\x96\x81 \xe2\x96\x85\xe2\x96\x88\xe2\x96\x86\xe2\x96\x82\xe2\x96\x81\xe2\x96\x81\xe2\x96\x83\xe2\x96\x86\xe2\x96\x85\xe2\x96\x83\xe2\x96\x81 \xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81                                      \xe2\x96\x82\n  \xe2\x96\x88\xe2\x96\x88\xe2\x96\x88\xe2\x96\x88\xe2\x96\x88\xe2\x96\x88\xe2\x96\x88\xe2\x96\x88\xe2\x96\x88\xe2\x96\x88\xe2\x96\x88\xe2\x96\x88\xe2\x96\x88\xe2\x96\x88\xe2\x96\x88\xe2\x96\x88\xe2\x96\x88\xe2\x96\x88\xe2\x96\x88\xe2\x96\x88\xe2\x96\x88\xe2\x96\x88\xe2\x96\x88\xe2\x96\x86\xe2\x96\x88\xe2\x96\x87\xe2\x96\x87\xe2\x96\x88\xe2\x96\x87\xe2\x96\x87\xe2\x96\x86\xe2\x96\x87\xe2\x96\x86\xe2\x96\x87\xe2\x96\x86\xe2\x96\x86\xe2\x96\x84\xe2\x96\x86\xe2\x96\x84\xe2\x96\x84\xe2\x96\x85\xe2\x96\x83\xe2\x96\x81\xe2\x96\x84\xe2\x96\x84\xe2\x96\x84\xe2\x96\x83\xe2\x96\x85\xe2\x96\x83\xe2\x96\x84\xe2\x96\x84\xe2\x96\x84\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x83\xe2\x96\x84\xe2\x96\x81\xe2\x96\x84 \xe2\x96\x88\n  1.95 \xce\xbcs      Histogram: log(frequency) by time      3.62 \xce\xbcs <\n\n Memory estimate: 256 bytes, allocs estimate: 1.\n\n# ---------------------------------------------------------------------\n\n@benchmark calc_corr2($mat)\nBenchmarkTools.Trial: 10000 samples with 10 evaluations.\n Range (min \xe2\x80\xa6 max):  1.220 \xce\xbcs \xe2\x80\xa6 773.080 \xce\xbcs  \xe2\x94\x8a GC (min \xe2\x80\xa6 max): 0.00% \xe2\x80\xa6 99.19%\n Time  (median):     1.420 \xce\xbcs               \xe2\x94\x8a GC (median):    0.00%\n Time  (mean \xc2\xb1 \xcf\x83):   1.698 \xce\xbcs \xc2\xb1   9.921 \xce\xbcs  \xe2\x94\x8a GC (mean \xc2\xb1 \xcf\x83):  8.15% \xc2\xb1  1.40%\n\n  \xe2\x96\x88          \n  \xe2\x96\x88\xe2\x96\x87\xe2\x96\x83\xe2\x96\x82\xe2\x96\x83\xe2\x96\x84\xe2\x96\x82\xe2\x96\x82\xe2\x96\x84\xe2\x96\x84\xe2\x96\x83\xe2\x96\x82\xe2\x96\x83\xe2\x96\x83\xe2\x96\x82\xe2\x96\x82\xe2\x96\x82\xe2\x96\x82\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81 \xe2\x96\x81\n  1.22 \xce\xbcs         Histogram: frequency by time        3.98 \xce\xbcs <\n\n Memory estimate: 976 bytes, allocs estimate: 7.\n

Run Code Online (Sandbox Code Playgroud)\n

测试结果是否相同：

\n

julia> calc_corr(mat) == calc_corr2(mat)\ntrue\n

Run Code Online (Sandbox Code Playgroud)\n

在大矩阵上：

\n

test_mat = rand(1:10, 2_000, 20);\n\n@benchmark calc_corr($test_mat)\nBenchmarkTools.Trial: 8 samples with 1 evaluation.\n Range (min \xe2\x80\xa6 max):  632.258 ms \xe2\x80\xa6 680.094 ms  \xe2\x94\x8a GC (min \xe2\x80\xa6 max): 0.33% \xe2\x80\xa6 1.30%\n Time  (median):     646.215 ms               \xe2\x94\x8a GC (median):    0.16%\n Time  (mean \xc2\xb1 \xcf\x83):   650.096 ms \xc2\xb1  16.089 ms  \xe2\x94\x8a GC (mean \xc2\xb1 \xcf\x83):  0.49% \xc2\xb1 0.60%\n\n  \xe2\x96\x81 \xe2\x96\x81           \xe2\x96\x81  \xe2\x96\x88       \xe2\x96\x81                 \xe2\x96\x81                \xe2\x96\x81  \n  \xe2\x96\x88\xe2\x96\x81\xe2\x96\x88\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x88\xe2\x96\x81\xe2\x96\x81\xe2\x96\x88\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x88\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x88\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x88 \xe2\x96\x81\n  632 ms           Histogram: frequency by time          680 ms <\n\n Memory estimate: 30.52 MiB, allocs estimate: 2.\n\n# ---------------------------------------------------------------------\n\n@benchmark calc_corr2($test_mat)\nBenchmarkTools.Trial: 14 samples with 1 evaluation.\n Range (min \xe2\x80\xa6 max):  351.040 ms \xe2\x80\xa6 396.431 ms  \xe2\x94\x8a GC (min \xe2\x80\xa6 max): 2.58% \xe2\x80\xa6 1.81%\n Time  (median):     357.403 ms               \xe2\x94\x8a GC (median):    2.86%\n Time  (mean \xc2\xb1 \xcf\x83):   360.863 ms \xc2\xb1  11.661 ms  \xe2\x94\x8a GC (mean \xc2\xb1 \xcf\x83):  2.75% \xc2\xb1 0.80%\n\n    \xe2\x96\x88           \n  \xe2\x96\x87\xe2\x96\x81\xe2\x96\x88\xe2\x96\x87\xe2\x96\x81\xe2\x96\x87\xe2\x96\x87\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x87\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x87\xe2\x96\x87\xe2\x96\x81\xe2\x96\x87\xe2\x96\x81\xe2\x96\x81\xe2\x96\x87\xe2\x96\x81\xe2\x96\x87\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x87 \xe2\x96\x81\n  351 ms           Histogram: frequency by time          396 ms <\n\n Memory estimate: 99.63 MiB, allocs estimate: 14.\n

Run Code Online (Sandbox Code Playgroud)\n

目前内存不是我主要关心的问题，我正在寻找一种方法来使这个过程更快、更优化。如果你创建一个像 10_000x100 这样更大的矩阵（所以内存），速度会很烦人。因此，我正在寻找任何可以帮助我提高此过程速度的建议。

\n

Answer 1

Prz*_*fel 5

不要重新发明轮子，只要做

cor(test_mat, dims=2)

Run Code Online (Sandbox Code Playgroud)

这比您的代码快得多。

设置：

test_mat = rand(1:10, 2_000, 20)

Run Code Online (Sandbox Code Playgroud)

现在基准测试：


julia> @btime calc_corr2($test_mat);
  709.354 ms (16 allocations: 99.63 MiB)

julia> @btime cor(test_mat, dims=2);
  52.679 ms (16 allocations: 30.85 MiB)

Run Code Online (Sandbox Code Playgroud)

别担心，我们一直在学习！ (2认同)

归档时间：	3 年前
查看次数：	124 次
最近记录：	3 年前