为什么在Julia-Lang中进行更多的运算时，简化的数学方程式的运行（略）慢于其等价方程式？

Question

为什么在Julia-Lang中进行更多的运算时，简化的数学方程式的运行（略）慢于其等价方程式？

zyc*_*zyc 3 math performance-testing simplify julia

在C ++课程中，我学会了一些技巧，例如避免重复计算，使用更多的加法而不是更多的乘法，避免幂等以提高性能。但是，当我尝试使用Julia-Lang优化代码时，我对相反的结果感到惊讶。

例如，以下是一些未进行数学优化的方程式（所有代码都是用Julia 1.1编写的，不是JuliaPro编写的）：

function OriginalFunction( a,b,c,d,E )
    # Oprations' count:
    # sqrt: 4
    # ^: 14
    # * : 14
    # / : 10
    # +: 20
    # -: 6
    # = : 0+4
    x1 = (1/(1+c^2))*(-c*d+a+c*b-sqrt(E))
    y1 = d-(c^2*d)/(1+c^2)+(c*a)/(1+c^2)+(c^2*b)/(1+c^2)-(c*sqrt(E))/(1+c^2)

    x2 = (1/(1+c^2))*(-c*d+a+c*b+sqrt(E))
    y2 = d-(c^2*d)/(1+c^2)+(c*a)/(1+c^2)+(c^2*b)/(1+c^2)+(c*sqrt(E))/(1+c^2)

    return [ [x1;y1] [x2;y2] ]
end

Run Code Online (Sandbox Code Playgroud)

我用一些技巧优化了它们，包括：

(a*b + a*c) -> a*(b+c) 因为加法比乘法快。
a^2 -> a*a 避免电源操作。
如果长时间操作至少使用了两次，请将其分配给变量以避免重复计算。例如：

x = a * (1+c^2); y = b * (1+c^2)
->
temp = 1+c^2
x = a * temp; y = b * temp

Run Code Online (Sandbox Code Playgroud)

将Int转换为Float64，以便计算机不必在运行时或编译时执行此操作。例如：

1/x -> 1.0/x

结果给出了具有更少运算的等效方程式：

function SimplifiedFunction( a,b,c,d,E )
    # Oprations' count:
    # sqrt: 1
    # ^: 0
    # *: 9
    # /: 1
    # +: 4
    # -: 6
    # = : 5+4
    temp1 = sqrt(E)
    temp2 = c*(b - d) + a
    temp3 = 1.0/(1.0+c*c)
    temp4 = d - (c*(c*(d - b) - a))*temp3
    temp5 = (c*temp1)*temp3
    x1 = temp3*(temp2-temp1)
    y1 = temp4-temp5

    x2 = temp3*(temp2+temp1)
    y2 = temp4+temp5

    return [ [x1;y1] [x2;y2] ]
end

Run Code Online (Sandbox Code Playgroud)

然后，我使用以下功能对它们进行了测试，期望具有更少操作的版本能够更快或更有趣地运行：

function Test2Functions( NumberOfTests::Real )
    local num = Int(NumberOfTests)
    # -- Generate random numbers
    local rands = Array{Float64,2}(undef, 5,num)
    for i in 1:num
        rands[:,i:i] = [rand(); rand(); rand(); rand(); rand()]
    end

    local res1 = Array{Array{Float64,2}}(undef, num)
    local res2 = Array{Array{Float64,2}}(undef, num)
    # - Test OriginalFunction
    @time for i in 1:num
        a,b,c,d,E = rands[:,i]
        res1[i] = OriginalFunction( a,b,c,d,E )
    end
    # - Test SimplifiedFunction
    @time for i in 1:num
        a,b,c,d,E = rands[:,i]
        res2[i] = SimplifiedFunction( a,b,c,d,E )
    end
    return res1, res2
end
Test2Functions( 1e6 )

Run Code Online (Sandbox Code Playgroud)

但是，事实证明这两个函数使用相同数量的内存分配，但是简化的一个函数具有更多的垃圾回收时间，并且运行速度慢了大约5％：

julia> Test2Functions( 1e6 )
  1.778731 seconds (7.00 M allocations: 503.540 MiB, 47.35% gc time)
  1.787668 seconds (7.00 M allocations: 503.540 MiB, 50.92% gc time)

julia> Test2Functions( 1e6 )
  1.969535 seconds (7.00 M allocations: 503.540 MiB, 52.05% gc time)
  2.221151 seconds (7.00 M allocations: 503.540 MiB, 56.68% gc time)

julia> Test2Functions( 1e6 )
  1.946441 seconds (7.00 M allocations: 503.540 MiB, 55.23% gc time)
  2.099875 seconds (7.00 M allocations: 503.540 MiB, 59.33% gc time)

julia> Test2Functions( 1e6 )
  1.836350 seconds (7.00 M allocations: 503.540 MiB, 53.37% gc time)
  2.011242 seconds (7.00 M allocations: 503.540 MiB, 58.43% gc time)

julia> Test2Functions( 1e6 )
  1.856081 seconds (7.00 M allocations: 503.540 MiB, 53.44% gc time)
  2.002087 seconds (7.00 M allocations: 503.540 MiB, 58.21% gc time)

julia> Test2Functions( 1e6 )
  1.833049 seconds (7.00 M allocations: 503.540 MiB, 53.55% gc time)
  1.996548 seconds (7.00 M allocations: 503.540 MiB, 58.41% gc time)

julia> Test2Functions( 1e6 )
  1.846894 seconds (7.00 M allocations: 503.540 MiB, 53.53% gc time)
  2.053529 seconds (7.00 M allocations: 503.540 MiB, 58.30% gc time)

julia> Test2Functions( 1e6 )
  1.896265 seconds (7.00 M allocations: 503.540 MiB, 54.11% gc time)
  2.083253 seconds (7.00 M allocations: 503.540 MiB, 58.10% gc time)

julia> Test2Functions( 1e6 )
  1.910244 seconds (7.00 M allocations: 503.540 MiB, 53.79% gc time)
  2.085719 seconds (7.00 M allocations: 503.540 MiB, 58.36% gc time)

Run Code Online (Sandbox Code Playgroud)

谁能告诉我为什么？即使在某些性能关键的代码中，5％的速度可能也不值得争夺，但我仍然很好奇：如何帮助Julia编译器生成更快的代码？

Answer 1

Bog*_*ski 7

原因是您在第二个循环（而不是第一个循环）中遇到了垃圾回收。如果GC.gc()在循环之前执行此操作，则会得到更多可比较的结果：

function Test2Functions( NumberOfTests::Real )
    local num = Int(NumberOfTests)
    # -- Generate random numbers
    local rands = Array{Float64,2}(undef, 5,num)
    for i in 1:num
        rands[:,i:i] = [rand(); rand(); rand(); rand(); rand()]
    end

    local res1 = Array{Array{Float64,2}}(undef, num)
    local res2 = Array{Array{Float64,2}}(undef, num)
    # - Test OriginalFunction
    GC.gc()
    @time for i in 1:num
        a,b,c,d,E = rands[:,i]
        res1[i] = OriginalFunction( a,b,c,d,E )
    end
    # - Test SimplifiedFunction
    GC.gc()
    @time for i in 1:num
        a,b,c,d,E = rands[:,i]
        res2[i] = SimplifiedFunction( a,b,c,d,E )
    end
    return res1, res2
end

# call this twice as the first time you may have precompilation issues
Test2Functions( 1e6 )
Test2Functions( 1e6 )

Run Code Online (Sandbox Code Playgroud)

但是，通常进行基准测试最好使用BenchmarkTools.jl程序包。

julia> function OriginalFunction()
           a,b,c,d,E = rand(5)
           x1 = (1/(1+c^2))*(-c*d+a+c*b-sqrt(E))
           y1 = d-(c^2*d)/(1+c^2)+(c*a)/(1+c^2)+(c^2*b)/(1+c^2)-(c*sqrt(E))/(1+c^2)
           x2 = (1/(1+c^2))*(-c*d+a+c*b+sqrt(E))
           y2 = d-(c^2*d)/(1+c^2)+(c*a)/(1+c^2)+(c^2*b)/(1+c^2)+(c*sqrt(E))/(1+c^2)
           return [ [x1;y1] [x2;y2] ]
       end
OriginalFunction (generic function with 2 methods)

julia>

julia> function SimplifiedFunction()
           a,b,c,d,E = rand(5)
           temp1 = sqrt(E)
           temp2 = c*(b - d) + a
           temp3 = 1.0/(1.0+c*c)
           temp4 = d - (c*(c*(d - b) - a))*temp3
           temp5 = (c*temp1)*temp3
           x1 = temp3*(temp2-temp1)
           y1 = temp4-temp5
           x2 = temp3*(temp2+temp1)
           y2 = temp4+temp5
           return [ [x1;y1] [x2;y2] ]
       end
SimplifiedFunction (generic function with 2 methods)

julia>

julia> using BenchmarkTools

julia> @btime OriginalFunction()
  136.211 ns (7 allocations: 528 bytes)
2×2 Array{Float64,2}:
 -0.609035  0.954271
  0.724708  0.926523

julia> @btime SimplifiedFunction()
  137.201 ns (7 allocations: 528 bytes)
2×2 Array{Float64,2}:
 0.284514  1.58639
 0.922347  0.979835

julia> @btime OriginalFunction()
  137.301 ns (7 allocations: 528 bytes)
2×2 Array{Float64,2}:
 -0.109814  0.895533
  0.365399  1.08743

julia> @btime SimplifiedFunction()
  136.429 ns (7 allocations: 528 bytes)
2×2 Array{Float64,2}:
 0.516157  1.07871
 0.219441  0.361133

Run Code Online (Sandbox Code Playgroud)

而且我们看到它们具有可比的性能。通常，您可以期望Julia和LLVM编译器会为您完成大多数此类优化（当然，并非总是可以保证，但是在这种情况下，似乎确实发生了）。

编辑

我简化了如下功能：

function OriginalFunction( a,b,c,d,E )
    x1 = (1/(1+c^2))*(-c*d+a+c*b-sqrt(E))
    y1 = d-(c^2*d)/(1+c^2)+(c*a)/(1+c^2)+(c^2*b)/(1+c^2)-(c*sqrt(E))/(1+c^2)
    x2 = (1/(1+c^2))*(-c*d+a+c*b+sqrt(E))
    y2 = d-(c^2*d)/(1+c^2)+(c*a)/(1+c^2)+(c^2*b)/(1+c^2)+(c*sqrt(E))/(1+c^2)
    x1, y1, x2, y2
end

function SimplifiedFunction( a,b,c,d,E )
    temp1 = sqrt(E)
    temp2 = c*(b - d) + a
    temp3 = 1.0/(1.0+c*c)
    temp4 = d - (c*(c*(d - b) - a))*temp3
    temp5 = (c*temp1)*temp3
    x1 = temp3*(temp2-temp1)
    y1 = temp4-temp5
    x2 = temp3*(temp2+temp1)
    y2 = temp4+temp5
    x1, y1, x2, y2
end

Run Code Online (Sandbox Code Playgroud)

仅专注于计算的核心并进行计算@code_native。这是它们（删除注释以缩短它们）。

        .text
        pushq   %rbp
        movq    %rsp, %rbp
        subq    $112, %rsp
        vmovaps %xmm10, -16(%rbp)
        vmovaps %xmm9, -32(%rbp)
        vmovaps %xmm8, -48(%rbp)
        vmovaps %xmm7, -64(%rbp)
        vmovaps %xmm6, -80(%rbp)
        vmovsd  56(%rbp), %xmm8         # xmm8 = mem[0],zero
        vxorps  %xmm4, %xmm4, %xmm4
        vucomisd        %xmm8, %xmm4
        ja      L229
        vmovsd  48(%rbp), %xmm9         # xmm9 = mem[0],zero
        vmulsd  %xmm9, %xmm3, %xmm5
        vsubsd  %xmm5, %xmm1, %xmm5
        vmulsd  %xmm3, %xmm2, %xmm6
        vaddsd  %xmm5, %xmm6, %xmm10
        vmulsd  %xmm3, %xmm3, %xmm6
        movabsq $526594656, %rax        # imm = 0x1F633260
        vmovsd  (%rax), %xmm7           # xmm7 = mem[0],zero
        vaddsd  %xmm7, %xmm6, %xmm0
        vdivsd  %xmm0, %xmm7, %xmm7
        vsqrtsd %xmm8, %xmm8, %xmm4
        vsubsd  %xmm4, %xmm10, %xmm5
        vmulsd  %xmm5, %xmm7, %xmm8
        vmulsd  %xmm9, %xmm6, %xmm5
        vdivsd  %xmm0, %xmm5, %xmm5
        vsubsd  %xmm5, %xmm9, %xmm5
        vmulsd  %xmm3, %xmm1, %xmm1
        vdivsd  %xmm0, %xmm1, %xmm1
        vaddsd  %xmm5, %xmm1, %xmm1
        vmulsd  %xmm2, %xmm6, %xmm2
        vdivsd  %xmm0, %xmm2, %xmm2
        vaddsd  %xmm1, %xmm2, %xmm1
        vmulsd  %xmm3, %xmm4, %xmm2
        vdivsd  %xmm0, %xmm2, %xmm0
        vsubsd  %xmm0, %xmm1, %xmm2
        vaddsd  %xmm10, %xmm4, %xmm3
        vmulsd  %xmm3, %xmm7, %xmm3
        vaddsd  %xmm1, %xmm0, %xmm0
        vmovsd  %xmm8, (%rcx)
        vmovsd  %xmm2, 8(%rcx)
        vmovsd  %xmm3, 16(%rcx)
        vmovsd  %xmm0, 24(%rcx)
        movq    %rcx, %rax
        vmovaps -80(%rbp), %xmm6
        vmovaps -64(%rbp), %xmm7
        vmovaps -48(%rbp), %xmm8
        vmovaps -32(%rbp), %xmm9
        vmovaps -16(%rbp), %xmm10
        addq    $112, %rsp
        popq    %rbp
        retq
L229:
        movabsq $throw_complex_domainerror, %rax
        movl    $72381680, %ecx         # imm = 0x45074F0
        vmovapd %xmm8, %xmm1
        callq   *%rax
        ud2
        ud2
        nop

Run Code Online (Sandbox Code Playgroud)

和

        .text
        pushq   %rbp
        movq    %rsp, %rbp
        subq    $64, %rsp
        vmovaps %xmm7, -16(%rbp)
        vmovaps %xmm6, -32(%rbp)
        vmovsd  56(%rbp), %xmm0         # xmm0 = mem[0],zero
        vxorps  %xmm4, %xmm4, %xmm4
        vucomisd        %xmm0, %xmm4
        ja      L178
        vmovsd  48(%rbp), %xmm4         # xmm4 = mem[0],zero
        vsqrtsd %xmm0, %xmm0, %xmm0
        vsubsd  %xmm4, %xmm2, %xmm5
        vmulsd  %xmm3, %xmm5, %xmm5
        vaddsd  %xmm1, %xmm5, %xmm5
        vmulsd  %xmm3, %xmm3, %xmm6
        movabsq $526593928, %rax        # imm = 0x1F632F88
        vmovsd  (%rax), %xmm7           # xmm7 = mem[0],zero
        vaddsd  %xmm7, %xmm6, %xmm6
        vdivsd  %xmm6, %xmm7, %xmm6
        vsubsd  %xmm2, %xmm4, %xmm2
        vmulsd  %xmm3, %xmm2, %xmm2
        vsubsd  %xmm1, %xmm2, %xmm1
        vmulsd  %xmm3, %xmm1, %xmm1
        vmulsd  %xmm1, %xmm6, %xmm1
        vsubsd  %xmm1, %xmm4, %xmm1
        vmulsd  %xmm3, %xmm0, %xmm2
        vmulsd  %xmm2, %xmm6, %xmm2
        vsubsd  %xmm0, %xmm5, %xmm3
        vmulsd  %xmm3, %xmm6, %xmm3
        vsubsd  %xmm2, %xmm1, %xmm4
        vaddsd  %xmm5, %xmm0, %xmm0
        vmulsd  %xmm0, %xmm6, %xmm0
        vaddsd  %xmm1, %xmm2, %xmm1
        vmovsd  %xmm3, (%rcx)
        vmovsd  %xmm4, 8(%rcx)
        vmovsd  %xmm0, 16(%rcx)
        vmovsd  %xmm1, 24(%rcx)
        movq    %rcx, %rax
        vmovaps -32(%rbp), %xmm6
        vmovaps -16(%rbp), %xmm7
        addq    $64, %rsp
        popq    %rbp
        retq
L178:
        movabsq $throw_complex_domainerror, %rax
        movl    $72381680, %ecx         # imm = 0x45074F0
        vmovapd %xmm0, %xmm1
        callq   *%rax
        ud2
        ud2
        nopl    (%rax,%rax)

Run Code Online (Sandbox Code Playgroud)

可能您不想详细了解它，但是您会看到简化的功能使用的指令少了一些，但只使用了一些，如果您比较原始代码，这可能会令人惊讶。例如，两个代码sqrt仅调用一次（因此sqrt，对拳头功能的多次调用已优化）。

归档时间：	6 年，5 月前
查看次数：	63 次
最近记录：	6 年，5 月前