我的 Julia 循环/去向量化代码出了什么问题

use*_*278 4 vectorization julia

我正在使用 Julia 1.0。请考虑以下代码:

using LinearAlgebra
using Distributions

## create random data
const data = rand(Uniform(-1,2), 100000, 2)

function test_function_1(data)
    theta = [1 2]
    coefs = theta * data[:,1:2]'
    res   = coefs' .* data[:,1:2]
    return sum(res, dims = 1)'
end

function test_function_2(data)
    theta   = [1 2]
    sum_all = zeros(2)
    for i = 1:size(data)[1]
        sum_all .= sum_all + (theta * data[i,1:2])[1] *  data[i,1:2]
    end
    return sum_all
end
Run Code Online (Sandbox Code Playgroud)

第一次运行后,我计时了

julia> @time test_function_1(data)
  0.006292 seconds (16 allocations: 5.341 MiB)
2×1 Adjoint{Float64,Array{Float64,2}}:
 150958.47189289227
 225224.0374366073

julia> @time test_function_2(data)
  0.038112 seconds (500.00 k allocations: 45.777 MiB, 15.61% gc time)
2-element Array{Float64,1}:
 150958.4718928927
 225224.03743660534
Run Code Online (Sandbox Code Playgroud)

test_function_1在分配和速度方面都显着优越,但test_function_1没有去向量化。我希望test_function_2表现更好。请注意,这两个函数的作用相同。

我有一种预感,因为在test_function_2,我使用sum_all .= sum_all + ...,但我不确定为什么这是一个问题。我可以得到提示吗?

Bog*_*ski 5

所以首先让我评论一下,如果我想使用循环,我将如何编写您的函数:

function test_function_3(data)
    theta   = (1, 2)
    sum_all = zeros(2)
    for row in eachrow(data)
        sum_all .+= dot(theta, row) .*  row
    end
    return sum_all
end
Run Code Online (Sandbox Code Playgroud)

接下来,这里是三个选项的基准比较:

julia> @benchmark test_function_1($data)
BenchmarkTools.Trial: 
  memory estimate:  5.34 MiB
  allocs estimate:  16
  --------------
  minimum time:     1.953 ms (0.00% GC)
  median time:      1.986 ms (0.00% GC)
  mean time:        2.122 ms (2.29% GC)
  maximum time:     4.347 ms (8.00% GC)
  --------------
  samples:          2356
  evals/sample:     1

julia> @benchmark test_function_2($data)
BenchmarkTools.Trial: 
  memory estimate:  45.78 MiB
  allocs estimate:  500002
  --------------
  minimum time:     16.316 ms (7.44% GC)
  median time:      16.597 ms (7.63% GC)
  mean time:        16.845 ms (8.01% GC)
  maximum time:     34.050 ms (4.45% GC)
  --------------
  samples:          297
  evals/sample:     1

julia> @benchmark test_function_3($data)
BenchmarkTools.Trial: 
  memory estimate:  96 bytes
  allocs estimate:  1
  --------------
  minimum time:     777.204 ?s (0.00% GC)
  median time:      791.458 ?s (0.00% GC)
  mean time:        799.505 ?s (0.00% GC)
  maximum time:     1.262 ms (0.00% GC)
  --------------
  samples:          6253
  evals/sample:     1
Run Code Online (Sandbox Code Playgroud)

接下来,如果您dot在循环中显式实现,则可以更快一些:

julia> function test_function_4(data)
           theta   = (1, 2)
           sum_all = zeros(2)
           for row in eachrow(data)
               @inbounds sum_all .+= (theta[1]*row[1]+theta[2]*row[2]) .*  row
           end
           return sum_all
       end
test_function_4 (generic function with 1 method)

julia> @benchmark test_function_4($data)
BenchmarkTools.Trial: 
  memory estimate:  96 bytes
  allocs estimate:  1
  --------------
  minimum time:     502.367 ?s (0.00% GC)
  median time:      502.547 ?s (0.00% GC)
  mean time:        505.446 ?s (0.00% GC)
  maximum time:     806.631 ?s (0.00% GC)
  --------------
  samples:          9888
  evals/sample:     1
Run Code Online (Sandbox Code Playgroud)

要了解差异,让我们看一下您的代码的这一行:

sum_all .= sum_all + (theta * data[i,1:2])[1] *  data[i,1:2]
Run Code Online (Sandbox Code Playgroud)

让我们计算一下你在这个表达式中所做的内存分配:

sum_all .= 
    sum_all
    + # allocation of a new vector as a result of addition
    (theta
     *  # allocation of a new vector as a result of multiplication
     data[i,1:2] # allocation of a new vector via getindex
    )[1]
    * # allocation of a new vector as a result of multiplication
    data[i,1:2] # allocation of a new vector via getindex
Run Code Online (Sandbox Code Playgroud)

所以你可以看到在循环的每次迭代中你分配了五次。分配是昂贵的。您可以在基准测试中看到这一点,您在此过程中有 5000002 次分配:

  • 1 分配 sum_all
  • 1 分配 theta
  • 循环中的 500000 次分配(5 * 100000)

此外,您执行索引,例如data[i,1:2]执行边界检查,这也是一个很小的成本(但与分配相比是微不足道的)。

现在在函数中test_function_3我使用eachrow(data). 这次我也得到了data矩阵行,但它们作为视图(不是新矩阵)返回,因此循环内不会发生分配。接下来,我dot再次使用一个函数来避免先前由矩阵乘法引起的分配(我已从 a更改theta为 a TupleMatrix因为那时dot速度稍快,但这是次要的)。最后我写了um_all .+= dot(theta, row) .* row,在这种情况下,所有操作都被广播,所以 Julia 可以进行广播融合(再次 - 没有分配发生)。

test_function_4我只是用dot展开循环替换,因为我们知道我们有两个元素来计算点积。实际上,如果您完全展开所有内容并使用@simd它会变得更快:

julia> function test_function_5(data)
          theta   = (1, 2)
          s1 = 0.0
          s2 = 0.0
          @inbounds @simd for i in axes(data, 1)
               r1 = data[i, 1]
               r2 = data[i, 2]
               mul = theta[1]*r1 + theta[2]*r2
               s1 += mul * r1
               s2 += mul * r2
          end
          return [s1, s2]
       end
test_function_5 (generic function with 1 method)

julia> @benchmark test_function_5($data)
BenchmarkTools.Trial: 
  memory estimate:  96 bytes
  allocs estimate:  1
  --------------
  minimum time:     22.721 ?s (0.00% GC)
  median time:      23.146 ?s (0.00% GC)
  mean time:        24.306 ?s (0.00% GC)
  maximum time:     100.109 ?s (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1
Run Code Online (Sandbox Code Playgroud)

所以你可以看到,这样你的速度比使用test_function_1. 仍然test_function_3相对较快并且它是完全通用的,所以通常我会写一些类似的东西,test_function_3除非我真的需要超快并且知道我的数据的维度是固定的并且很小。