Julia - 非常高性能的矩阵计算 -> 高效的内存管理

Question

Julia - 非常高性能的矩阵计算 -> 高效的内存管理

bry*_*y-J 5 math numeric matrix julia

我有一个关于 Julia 中矩阵乘法的计算问题。事实上我今天开始学习 Julia。以前我经常使用Python、numpy 等。

\n

但是，我有一个 6x6 矩阵A ，我想将其与已有的初始向量x ₁相乘，然后将另一个向量a ₁添加到乘积中。然后我使用它的输出向量x ₂作为具有相同矩阵A的相同操作的输入，但不同的向量a ₂等等。

\n

那是：

\n

迭代：x ₁
迭代：x ₂ = A x ₁ + a ₁
迭代x ₃ = A x ₂ + a ₂
迭代...

\n

涉及的对象非常小，但问题是，我想多次执行此操作：大约 ~10 ¹¹到 ~10 ¹³次。此外，我需要非常高的数值精度，远远高于 Float64 给出的精度。

\n

我能够编写一种直接的方法，它可以满足我的要求并达到我想要的精度。在良好/成熟的Python实践中，我已经预定义了我的数组，并在进行过程中用计算值填充它们。然而，我感觉我受到低效内存管理的瓶颈，正如分配数量可能表明的那样。

\n

时不时地（比如每 1000 次迭代）我想将当前迭代的向量x _i保存到另一个数组m，这是函数的输出和我的结果。

\n

我认为用第 i次迭代A x + a _i的计算结果替换向量x的值会很有效。显然这个就地操作可以由包的函数来处理，它使用 BLAS 来完成。我也调查过 mul!LinearAlgebramul!(Y, A, B) -> YStaticArrays软件包，但我不确定这是否有帮助。

\n
我没有看到任何计算加速或内存分配减少。我可以以某种方式优化我的代码吗？注意：为了简单起见，在本例中我在每次迭代中使用相同的向量a。
\n
\nusing CSV\nusing DataFrames\nusing Arblib\nusing LinearAlgebra\n\nsetprecision(350) # this is required for the high accuracy\n\nfunction compute() # global function wrapper, I thought this would be better\n dff = rand(BigFloat,(6,6)); # I would usually import a specific external matrix\n df2 = Matrix(dff)\n\n vecci = Array{BigFloat}([0,0,0,0,1e-21,1e-23]); # this will be my offset vector a\n x1 = Array{BigFloat}([0,0,0,0,1e-21,1e-23]); # this will be my input vector x0\n\n\n\n function fold(mat, x, a) # here it goes\n numevals = Int(1e6); #here I would put 1e11\n \n numsaves = 10^(Int(log10(numevals))-3) #takes three orders of magnitude less than numevals, this is the number of iterations I want to save in my array\n m = zeros(BigFloat,6, Int(numevals/numsaves)); # my output array where I store the x_i\n \n lastone = zeros(BigFloat,6);\n newone = zeros(BigFloat,6);\n\n m[1:6] .= x; # first iteration\n lastone .= x; # used as input for the computation\n\n for k in 1:(numevals-2)\n mul!(newone, mat, lastone); #replaces current newone with the result of matrix multiplication\n newone .+= a; #adds the offset vector\n lastone .= newone; # prepares for next iteration\n \n if mod(k,numsaves)==0 # every 1e3 iterations it stores in output array\n m[1+6*Int(k/numsaves):6+6*Int(k/numsaves)] .= newone\n end\n \n end\n return m\n end\n\n @time out2 = fold(df2, x1, vecci)\n\nend\n\ncompute()\n\n
Run Code Online (Sandbox Code Playgroud)\n
输出是例如
\n
12.948946 seconds (204.00 M allocations: 11.399 GiB, 13.52% gc time)\n6\xc3\x971000 Matrix{BigFloat}:\n 0.0 7.03814e+465 1.93614e+953 5.3262e+1440 1.4652e+1928 4.03066e+2415 1.10881e+2903 \xe2\x80\xa6 4.9002e+484980 1.34801e+485468 3.70828e+485955 1.02012e+486443 2.80629e+486930\n 0.0 6.45989e+465 1.77707e+953 4.8886e+1440 1.34482e+1928 3.6995e+2415 1.01771e+2903 4.4976e+484980 1.23726e+485468 3.40361e+485955 9.3631e+486442 2.57572e+486930\n 0.0 8.17893e+465 2.24997e+953 6.1895e+1440 1.70269e+1928 4.68398e+2415 1.28853e+2903 5.69445e+484980 1.5665e+485468 4.30935e+485955 1.18547e+486443 3.26115e+486930\n 0.0 6.71808e+465 1.8481e+953 5.08399e+1440 1.39857e+1928 3.84737e+2415 1.05838e+2903 4.67736e+484980 1.28671e+485468 3.53965e+485955 9.73733e+486442 2.67867e+486930\n 1.0e-21 7.55747e+465 2.07901e+953 5.7192e+1440 1.57331e+1928 4.32808e+2415 1.19062e+2903 5.26177e+484980 1.44748e+485468 3.98191e+485955 1.0954e+486443 3.01336e+486930\n 1.0e-23 8.44558e+465 2.32332e+953 6.39129e+1440 1.7582e+1928 4.83669e+2415 1.33054e+2903 \xe2\x80\xa6 5.88011e+484980 1.61758e+485468 4.44984e+485955 1.22412e+486443 3.36747e+486930\n
Run Code Online (Sandbox Code Playgroud)\n
编辑：
\n
感谢这些评论，我设法使用类型转换和 MultiFloat 包而不是 BigFloat 更新代码。这使得速度提高了大约 3 倍，但也损失了很多准确性。但是分配的数量和所需的内存下降了很多，这已经是一个很大的进步了！
\n
using CSV\nusing DataFrames\nusing MultiFloats\nusing LinearAlgebra\nusing Arblib\n\nsetprecision(350) # this is required for the high accuracy OF THE IMPORT ONLY\n\nfunction compute() # global function wrapper, I thought this would be better\n dff = rand(Float64x5,(6,6)); # I would usually import a specific external matrix\n df1 = Array{Float64x5}(dff)\n df2 = Matrix(df1)\n\n \n\n vecci = Array{Float64x5}([0,0,0,0,1e-21,1e-23]); # this will be my offset vector a\n x1 = Array{Float64x5}([0,0,0,0,1e-21,1e-23]); # this will be my input vector x0\n\n\n function fold(mat::Matrix{MultiFloat{Float64, 5}}, x::Vector{MultiFloat{Float64, 5}}, a::Vector{MultiFloat{Float64, 5}}) # here it goes\n numevals = Int(1e6); #here I would put 1e11\n \n numsaves = 10^(Int(log10(numevals))-3) #takes three orders of magnitude less than numevals, this is the number of iterations I want to save in my array\n m = zeros(Float64x5,6, Int(numevals/numsaves)); # my output array where I store the x_i\n \n lastone = zeros(Float64x5,6);\n newone = zeros(Float64x5,6);\n\n m[1:6] .= x; # first iteration\n lastone .= x; # used as input for the computation\n\n for k in 1:(numevals-2)\n newone .= mat*lastone + a;\n lastone .= newone; # prepares for next iteration\n \n if mod(k,numsaves)==0 # every 1e3 iterations it stores in output array\n #print(newone)\n m[1+6*Int(k/numsaves):6+6*Int(k/numsaves)] .= newone\n end\n \n end\n return m\n end\n\n @time out2 = fold(df2, x1, vecci)\n\nend\n\ncompute()\n\n
Run Code Online (Sandbox Code Playgroud)\n
现在的输出是
\n
2.898294 seconds (2.00 M allocations: 580.139 MiB, 4.80% gc time)\n
Run Code Online (Sandbox Code Playgroud)\n
编辑2：
\n
经过几次尝试和与同事的讨论后，我能够通过使用静态数组大大减少分配数量。我还关注了类型转换和预分配。但是，与第二个代码示例相比，这并没有导致速度提高。相反，代码慢了 25%。不过，由于内存消耗要低得多，我会坚持使用它。
\n
此外，对于我的用例，我可能会尝试进行收敛研究，以找到使用高精度浮点数时精度和速度的最佳折衷方案。对于常规浮点数，单个矩阵乘法和向量加法大约需要 140 纳秒，这可能是本例中最快的。仅当精度高几倍时（例如 Float64x5），操作大约需要 4\xc2\xb5s，这太疯狂了。
\n
我强烈推荐TimerOutputs用于查找代码瓶颈的包。
\n

Answer 1

DNF*_*DNF 1

您的代码只返回一堆 NaN，但如果我减少保存之间的间隔，我可以知道它们在进入 NaN 之前返回相同的值。

最重要的是，我认为只要使用 MultiFloats 就无法提高性能。以下代码非常显着地减少了分配，但仅略微提高了速度。我尝试了 StaticArrays，这让它运行得更慢，我尝试了 Octavian.jl/LoopVectorization.jl，它不能很好地与 MultiFloats 配合使用。在我的笔记本电脑上，将 2 相乘Float64x5大约需要 32 纳秒，而 6x6 矩阵向量乘积大约需要 6^2 倍的时间。显然没有并行加速，没有 simd 效果，没有可以利用的多线程。基本上，你需要一个更高精度的 simd 友好的浮点数，我不知道有什么。

但我对你的代码做了一些改进。评论：

dff是一个矩阵。我不知道你为什么要创建df1和df2，它们看起来是多余的。
Float64x5(1e-23)不准确，因为转换之前存在精度损失。Float64x5("1e-23")避免了这个问题。
我变得fold更加通用，它现在将接受不同的数组类型和元素。（另外，我不喜欢嵌套函数，所以我将其移出。）
使用整数算术而不是将浮点数转换为 Int，例如div(a, b)代替Int(a/b). 并写10^6代替Int(1e6).
当索引到矩阵时m，使用矩阵索引而不是笨拙的线性索引。
用于mul!避免分配。
该线路lastone, newone = newone, lastone以零成本切换两个向量，无需复制数据。

...

function compute()
    dff = rand(Float64x5, 6, 6)
    vecci = Float64x5.(["0", "0", "0", "0", "1e-21", "1e-23"])
    x1 = copy(vecci) # this will be my input vector x0
    return fold(dff, x1, vecci)
end

function fold(mat::AbstractMatrix{T}, x::AbstractVector{T}, a::AbstractVector{T}) where {T<:Number}
    numevals = 10^6
    saveinterval = 1000
    numsaves = div(numevals, saveinterval)
    m = zeros(T, length(a), numsaves)
       
    lastone = copy(x)
    newone = zero(x)
    m[:, 1] .= x
    
    for k in 1:(numevals-2)
        # >95% of the time is spent on the next two lines, though they have zero allocations.
        mul!(newone, mat, lastone)
        newone .+= a
        (d, r) = divrem(k, saveinterval)
        if r == 0
            m[:, d+1] .= newone
        end
        lastone, newone = newone, lastone # prepares for next iteration
    end
    return m
end

Run Code Online (Sandbox Code Playgroud)

归档时间：	3 年前
查看次数：	420 次
最近记录：	2 年，10 月前

Julia - 非常高性能的矩阵计算 -&gt; 高效的内存管理

Julia - 非常高性能的矩阵计算 -> 高效的内存管理