Julia 中的 CUDA 示例不使用 GPU

Question

Julia 中的 CUDA 示例不使用 GPU

我正在执行在 GPU 上运行 Julia 1.6.5 代码的第一步。由于某种原因，GPU 似乎根本没有被使用。这些是步骤：

首先，我的 GPU 通过了CUDA Julia Docs推荐的测试：

# install the package
using Pkg
Pkg.add("CUDA")
  
# smoke test (this will download the CUDA toolkit)
using CUDA
CUDA.versioninfo()

using Pkg
Pkg.test("CUDA")    # takes ~40 minutes if using 1 thread

Run Code Online (Sandbox Code Playgroud)

其次，下面的代码在我的 GPU 上运行大约需要 8 分钟（实时）。它加载两个矩阵 10000 x 10000 并相乘 10 次：

using CUDA
using Random
N = 10000

a_d = CuArray{Float32}(undef, (N, N))
b_d = CuArray{Float32}(undef, (N, N))
c_d = CuArray{Float32}(undef, (N, N))

for i in 1:10
    global a_d = randn(N, N)
    global b_d = randn(N, N)

    global c_d = a_d * b_d
end

global a_d = nothing
global b_d = nothing
global c_d = nothing
GC.gc()

Run Code Online (Sandbox Code Playgroud)

终端结果如下：

(base) ciro@ciro-G3-3500:~/projects/julia/cuda$ time julia cuda-gpu.jl

real    8m13,016s
user    50m39,146s
sys 13m16,766s

Run Code Online (Sandbox Code Playgroud)

然后，运行 CPU 的等效代码。执行时间也相同：

using Random
N = 10000

for i in 1:10
    a = randn(N, N)
    b = randn(N, N)

    c = a * b
end

Run Code Online (Sandbox Code Playgroud)

执行：

(base) ciro@ciro-G3-3500:~/projects/julia/cuda$ time julia cuda-cpu.jl

real    8m2,689s 
user    50m9,567s 
sys 13m3,738s

Run Code Online (Sandbox Code Playgroud)

此外，通过遵循 NVTOP 屏幕命令上的信息，除了仍然使用与我的常规 CPU 相同的 800% CPU（或八个内核）（这是相同的用法）之外，看到 GPU 内存和内核相应地加载/卸载也很奇怪CPU版本有。

任何提示都将不胜感激。

Answer 1

S.S*_*ace 5

有一些因素会阻止您的代码正常快速地工作。

首先，您CuArray使用来用普通的 CPU 数组覆盖分配的 s randn，这意味着矩阵乘法在 CPU 上运行。你应该改用CUDA.randn。通过使用CUDA.randn!，您不会分配超出已分配内存的任何内存。

其次，您使用全局变量和全局作用域，这对性能不利。

第三，您正在使用C = A * B它重新分配内存。您应该使用就地版本mul!。

我会提出以下解决方案：

using CUDA
using LinearAlgebra
N = 10000

a_d = CuArray{Float32}(undef, (N, N))
b_d = CuArray{Float32}(undef, (N, N))
c_d = CuArray{Float32}(undef, (N, N))

# wrap your code in a function
# `!` is a convention to indicate that the arguments will be modified
function randn_mul!(A, B, C)
    CUDA.randn!(A)
    CUDA.randn!(B)
    mul!(C, A, B)
end

# use CUDA.@time to time the GPU execution time and memory usage:
for i in 1:10
    CUDA.@time randn_mul!(a_d, b_d, c_d)
end

Run Code Online (Sandbox Code Playgroud)

它在我的机器上运行得非常快：

using CUDA
using LinearAlgebra
N = 10000

a_d = CuArray{Float32}(undef, (N, N))
b_d = CuArray{Float32}(undef, (N, N))
c_d = CuArray{Float32}(undef, (N, N))

# wrap your code in a function
# `!` is a convention to indicate that the arguments will be modified
function randn_mul!(A, B, C)
    CUDA.randn!(A)
    CUDA.randn!(B)
    mul!(C, A, B)
end

# use CUDA.@time to time the GPU execution time and memory usage:
for i in 1:10
    CUDA.@time randn_mul!(a_d, b_d, c_d)
end

Run Code Online (Sandbox Code Playgroud)

请注意，第一次调用该函数时，执行时间和内存使用量较高，因为每次使用给定类型签名首次调用函数时都会测量编译时间。

归档时间：	4 年，1 月前
查看次数：	868 次
最近记录：	4 年，1 月前