优化 CUDA.jl 中的模拟

Question

优化 CUDA.jl 中的模拟

Wal*_*cio -2 performance cuda gpu julia

我正在尝试用 Julia 编写有关 GPU 计算的教程。在演示简单的矩阵运算时，一切都很顺利，GPU 击败了单线程和多线程的等效项。

\n

现在我试图提出一个更复杂的例子，涉及模拟数据的生成X和一些估计的计算\xce\xb2，而这就是事情变得奇怪的时候。无论我做什么，GPU (Nvidia RTX 2070) 模拟的性能都比其多线程 (20) 模拟性能差约 20 倍。

\n

以下是 MRE 的一些代码：

\n

# Meta-simulation constants =================================\nreplications = 10\nn = 100\np = 2\n\xce\xbc = rand(replications)\n\n# Multi-threaded simulations =================================\n\xce\xb2_par = fill(0., (p, replications))\nfunction parsim()\n  Threads.@threads for r in 1:replications\n    X = rand(Float16, (n, p)) .* \xce\xbc[r]; # Sample data\n    \xce\xb2 = sum(X .^ 2, dims = 1);   # Estimate parameters\n    \xce\xb2_par[:, r] = \xce\xb2\n  end\nend\n\n# GPU simulations =================================\nusing CUDA\n\xce\xb2_gpu = CuArray(fill(0., (p, replications)))\nfunction gpusim()\n  for r in 1:replications\n    X = CuArray(rand(Float16, (n, p))) .* \xce\xbc[r]; # Sample data\n    \xce\xb2 = sum(X .^ 2, dims = 1);   # Estimate parameters\n    \xce\xb2_par[:, r] = \xce\xb2\n  end\nend\n

Run Code Online (Sandbox Code Playgroud)\n

我花了十几个小时试图gpusim()至少达到同样的效果parsim()。我已经阅读了CUDA.jl和LinearAlgebra.jl文档无数次，\n试图弄清楚我是否真的需要编写自己的内核；在意识到我对自己在做什么一无所知后，我会尝试使用线性代数找到一个更简单的解决方案。阅读更多内容，变得更有信心，再次尝试编写内核。冲洗并重复。

\n

ChatGPT 帮助我找到了方向，但其复杂的解决方案并不比我gpusim()上面的天真更好。

\n

希望人类能够更好地帮助我理解如何（甚至是否）我能够在上述任务中gpusim()表现出色。parsim()

\n

Answer 1

loo*_*ick 6

如果您想通过 GPU 加速获得有意义的成果，我强烈建议您从NVIDIA 的 CUDA 编程指南开始，了解当您必须在 GPU 上执行计算时到底会发生什么。这里有很多东西需要解开，但让我们只介绍基础知识。

\n

数据必须传输到 GPU VRAM 才能执行计算。这种数据传输通常是主要瓶颈。在您的特定示例中，\nCuArray(rand(Float16, (n,p)))在 CPU 内存中生成随机数，然后在CuArray构造函数中将其复制到 GPU VRAM。

\n

类似地，\xce\xbc在 CPU 内存中，类似的东西x .* \xce\xbc必须再次将单个标量值从一个内存区域复制到另一个内存区域，然后执行计算，这是非常低效的。

\n

这段代码还存在其他问题，例如在函数内部使用非类型化全局变量。

\n

因此，要修复基准 GPU 实现的性能。

\n

使用CUDA.randandCUDA.fill代替Base.randandBase.fill直接在 GPU 上生成数据。
不要使用全局变量并将数据传递给函数（请参阅文档中的性能提示）。
CuArray禁止在s 中使用标量索引CUDA.allowscalar(false)（请参阅CUDA.jl 文档）

\n

# Meta-simulation constants =================================\nreplications = 10\nn = 100\np = 2\n\xce\xbc = rand(replications)\n\n# GPU simulations =================================\nusing CUDA\n\xce\xb2_gpu = CUDA.fill(0., (p, replications))\nfunction gpusim!(\xce\xb2,n,\xce\xbc)\n    op,replications = size(\xce\xb2)\n    for r in 1:replications\n        X = CUDA.rand(Float16, (n, p)) .* \xce\xbc[r]; # Sample data\n        @views \xce\xb2[:,r] .= sum(X .^ 2, dims = 1)[:];   # Estimate parameters\n    end\nend\n

Run Code Online (Sandbox Code Playgroud)\n

julia> @benchmark gpusim()\nBenchmarkTools.Trial: 5600 samples with 1 evaluation.\n Range (min \xe2\x80\xa6 max):  725.984 \xce\xbcs \xe2\x80\xa6 108.525 ms  \xe2\x94\x8a GC (min \xe2\x80\xa6 max): 0.00% \xe2\x80\xa6 25.07%\n Time  (median):     812.230 \xce\xbcs               \xe2\x94\x8a GC (median):    0.00%\n Time  (mean \xc2\xb1 \xcf\x83):   888.733 \xce\xbcs \xc2\xb1   2.795 ms  \xe2\x94\x8a GC (mean \xc2\xb1 \xcf\x83):  2.07% \xc2\xb1  0.65%\n\n                     \xe2\x96\x81\xe2\x96\x82\xe2\x96\x83\xe2\x96\x83\xe2\x96\x82\xe2\x96\x82\xe2\x96\x82\xe2\x96\x81\xe2\x96\x83\xe2\x96\x86\xe2\x96\x88\xe2\x96\x88\xe2\x96\x86\xe2\x96\x84\xe2\x96\x83                             \n  \xe2\x96\x81\xe2\x96\x81\xe2\x96\x82\xe2\x96\x83\xe2\x96\x86\xe2\x96\x86\xe2\x96\x87\xe2\x96\x88\xe2\x96\x87\xe2\x96\x87\xe2\x96\x87\xe2\x96\x88\xe2\x96\x85\xe2\x96\x86\xe2\x96\x86\xe2\x96\x87\xe2\x96\x87\xe2\x96\x88\xe2\x96\x87\xe2\x96\x88\xe2\x96\x88\xe2\x96\x88\xe2\x96\x88\xe2\x96\x88\xe2\x96\x88\xe2\x96\x88\xe2\x96\x88\xe2\x96\x88\xe2\x96\x88\xe2\x96\x88\xe2\x96\x88\xe2\x96\x88\xe2\x96\x88\xe2\x96\x88\xe2\x96\x87\xe2\x96\x85\xe2\x96\x83\xe2\x96\x84\xe2\x96\x84\xe2\x96\x84\xe2\x96\x84\xe2\x96\x84\xe2\x96\x83\xe2\x96\x85\xe2\x96\x84\xe2\x96\x85\xe2\x96\x84\xe2\x96\x84\xe2\x96\x85\xe2\x96\x85\xe2\x96\x84\xe2\x96\x85\xe2\x96\x84\xe2\x96\x85\xe2\x96\x84\xe2\x96\x82\xe2\x96\x82\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81 \xe2\x96\x84\n  726 \xce\xbcs           Histogram: frequency by time          921 \xce\xbcs <\n\n Memory estimate: 68.62 KiB, allocs estimate: 1251.\n\njulia> @benchmark gpusim!(\xce\xb2_gpu,n,\xce\xbc)\nBenchmarkTools.Trial: 7638 samples with 1 evaluation.\n Range (min \xe2\x80\xa6 max):  351.787 \xce\xbcs \xe2\x80\xa6 115.958 ms  \xe2\x94\x8a GC (min \xe2\x80\xa6 max): 0.00% \xe2\x80\xa6 29.52%\n Time  (median):     570.030 \xce\xbcs               \xe2\x94\x8a GC (median):    0.00%\n Time  (mean \xc2\xb1 \xcf\x83):   650.441 \xce\xbcs \xc2\xb1   3.159 ms  \xe2\x94\x8a GC (mean \xc2\xb1 \xcf\x83):  4.00% \xc2\xb1  0.82%\n\n                                                  \xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x83\xe2\x96\x84\xe2\x96\x88\xe2\x96\x86\xe2\x96\x82       \n  \xe2\x96\x82\xe2\x96\x83\xe2\x96\x83\xe2\x96\x82\xe2\x96\x82\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x82\xe2\x96\x82\xe2\x96\x82\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x82\xe2\x96\x81\xe2\x96\x81\xe2\x96\x81\xe2\x96\x82\xe2\x96\x81\xe2\x96\x82\xe2\x96\x82\xe2\x96\x82\xe2\x96\x82\xe2\x96\x84\xe2\x96\x86\xe2\x96\x86\xe2\x96\x85\xe2\x96\x86\xe2\x96\x86\xe2\x96\x85\xe2\x96\x85\xe2\x96\x85\xe2\x96\x85\xe2\x96\x85\xe2\x96\x85\xe2\x96\x86\xe2\x96\x86\xe2\x96\x88\xe2\x96\x88\xe2\x96\x88\xe2\x96\x88\xe2\x96\x88\xe2\x96\x88\xe2\x96\x88\xe2\x96\x88\xe2\x96\x88\xe2\x96\x88\xe2\x96\x86\xe2\x96\x85\xe2\x96\x85\xe2\x96\x84\xe2\x96\x83 \xe2\x96\x83\n  352 \xce\xbcs           Histogram: frequency by time          618 \xce\xbcs <\n\n Memory estimate: 73.12 KiB, allocs estimate: 1640.\n

Run Code Online (Sandbox Code Playgroud)\n

但是，非常清楚的是，这仍然是糟糕的表现；这种特殊的算法可以更有效地编写，以充分利用 GPU 并行性。

\n

归档时间：	2 年，7 月前
查看次数：	404 次
最近记录：	2 年，6 月前