Chr*_*ris 2 python benchmarking cuda numpy numba
我正在发现 numba 的 cuda 扩展,并查看了 CUDA 上矩阵乘法的示例实现。代码位于numba 的网站上。
\n然后我用我认为不太理想的实现对其进行了基准测试:numpy 的点函数,用于将两个 1024x1024 矩阵相乘(使用 生成randn(1024,1024))
结果:
\n如果 numpy 的算法是朴素矩阵乘法,那么它应该需要 1024^3 ~ 1e9 乘法和加法。这是每 5ms/1e9 = 5 皮秒一次操作的平均吞吐量。我的 CPU 运行频率约为 3.4 GHz,因此每个周期需要 300 皮秒。
\n所以我的问题是:numpy 的矩阵乘法如何比普通矩阵乘法快 60 倍?
\n我听说过 Strassen 的算法,其复杂度约为 N^2.8,因此每次乘法和加法需要 20 皮秒。仍然比 CPU 快 30 倍。
\n编辑:
\nfrom numba import cuda, float32\n\nTPB = 16\n\n@cuda.jit()\ndef fast_matmul(A, B, C):\n # Define an array in the shared memory\n # The size and type of the arrays must be known at compile time\n sA = cuda.shared.array(shape=(TPB, TPB), dtype=float32)\n sB = cuda.shared.array(shape=(TPB, TPB), dtype=float32)\n\n x, y = cuda.grid(2)\n\n tx = cuda.threadIdx.x\n ty = cuda.threadIdx.y\n bpg = cuda.gridDim.x # blocks per grid\n\n if x >= C.shape[0] and y >= C.shape[1]:\n # Quit if (x, y) is outside of valid C boundary\n return\n\n # Each thread computes one element in the result matrix.\n # The dot product is chunked into dot products of TPB-long vectors.\n tmp = 0.\n for i in range(bpg):\n # Preload data into shared memory\n sA[tx, ty] = A[x, ty + i * TPB]\n sB[tx, ty] = B[tx + i * TPB, y]\n\n # Wait until all threads finish preloading\n cuda.syncthreads()\n\n # Computes partial product on the shared memory\n for j in range(TPB):\n tmp += sA[tx, j] * sB[j, ty]\n\n # Wait until all threads finish computing\n cuda.syncthreads()\n\n C[x, y] = tmp\nRun Code Online (Sandbox Code Playgroud)\nN=1024\na=randn(N,N).astype(np.float32)\nb=a.T.copy()\nc=zeros((N,N),dtype=np.float32)\n\nthreadsperblock = (16, 16)\nblockspergrid_x = math.ceil(a.shape[0] / threadsperblock[0])\nblockspergrid_y = math.ceil(a.shape[1] / threadsperblock[1])\nblockspergrid = (blockspergrid_x, blockspergrid_y)\n\nfast_matmul[blockspergrid, threadsperblock](a,b,c) # takes 40ms\nRun Code Online (Sandbox Code Playgroud)\n同时,使用 numpy:
\nc=a.dot(b) # takes 5ms\nRun Code Online (Sandbox Code Playgroud)\n我不认为吞吐量是瓶颈,因为矩阵的大小为数百万,所需的周期为数十亿。
\n正如一位评论者所问,数组是 32 位浮点数。
\n编辑2:
\n我知道 3GHz CPU 无法在 5ps 内执行单个任务,所以显然我指的是平均吞吐量。由于吞吐量比假设每个周期 1 次乘法和加法的估计要好 30 倍,这意味着该实现经过严格优化,使用
\n对于 CUDA,我测试的基本类型是单精度浮点,事先假设它们的最佳位置是单精度或半精度浮点。
\n不过,对我来说,最重要的一点是,CPU 的编译现在可以执行极其激进的优化(向量运算、多线程),并且在某些情况下很难用 GPU 来击败它,即使对于具有非常有规律的图案。即使您希望每个输入样本执行 1000 次左右的计算,与设备之间的传输也可能非常昂贵。
\n最后但并非最不重要的一点是,我要感谢 @user2640045 的基准测试和对数插值,这似乎表明计算是 O(N^3) 的,因此 numpy 似乎使用最简单的矩阵乘积实现。
\n为什么 CUDA GPU 矩阵乘法比 numpy 慢?
真正简短的答案是,它们可能不是。
numpy 为何这么快?
因为它通常在 GEMM 操作的幕后使用最先进的平台优化(通常是多线程)BLAS 实现。
据我所知,您对为什么测量测量内容的困惑源于对 Numba 如何工作、numpy 如何工作以及 GPU 和 CPU 如何工作的一些误解。让我们对您的代码进行轻微修改,用于测量时间并转换为您正在研究的 gemm 问题的 GFLOP/s。它运行四个案例,每个案例 10 次:
这看起来像这样:
from numba import cuda, float32
import numpy as np
import time
import math
import cupy
TPB = 16
@cuda.jit()
def fast_matmul(A, B, C):
# Define an array in the shared memory
# The size and type of the arrays must be known at compile time
sA = cuda.shared.array(shape=(TPB, TPB), dtype=float32)
sB = cuda.shared.array(shape=(TPB, TPB), dtype=float32)
x, y = cuda.grid(2)
tx = cuda.threadIdx.x
ty = cuda.threadIdx.y
bpg = cuda.gridDim.x # blocks per grid
if x >= C.shape[0] and y >= C.shape[1]:
# Quit if (x, y) is outside of valid C boundary
return
# Each thread computes one element in the result matrix.
# The dot product is chunked into dot products of TPB-long vectors.
tmp = 0.
for i in range(bpg):
# Preload data into shared memory
sA[tx, ty] = A[x, ty + i * TPB]
sB[tx, ty] = B[tx + i * TPB, y]
# Wait until all threads finish preloading
cuda.syncthreads()
# Computes partial product on the shared memory
for j in range(TPB):
tmp += sA[tx, j] * sB[j, ty]
# Wait until all threads finish computing
cuda.syncthreads()
C[x, y] = tmp
N=1024
a=np.random.randn(N,N).astype(np.float32)
b=a.T.copy()
c=np.zeros((N,N),dtype=np.float32)
threadsperblock = (16, 16)
blockspergrid_x = math.ceil(a.shape[0] / threadsperblock[0])
blockspergrid_y = math.ceil(a.shape[1] / threadsperblock[1])
blockspergrid = (blockspergrid_x, blockspergrid_y)
a_d=cuda.to_device(a)
b_d=cuda.to_device(b)
c_d=cuda.to_device(c)
gfcalc = lambda t0,t1: 1e-9 * (N * N * (2*N)) / (t1-t0)
for i in range(0,10):
t0 = time.time()
fast_matmul[blockspergrid, threadsperblock](a, b, c)
cuda.synchronize()
t1 = time.time()
print("Numba, host source & dest", gfcalc(t0,t1))
for i in range(0,10):
t0 = time.time()
fast_matmul[blockspergrid, threadsperblock](a_d, b_d, c_d)
cuda.synchronize()
t1 = time.time()
print("Numba, device source & dest", gfcalc(t0,t1))
for i in range(0,10):
t0 = time.time()
c_h = a.dot(b)
t1 = time.time()
print("Numpy", gfcalc(t0,t1))
with cupy.cuda.Device() as dev:
a_c = cupy.asarray(a)
b_c = cupy.asarray(b)
c_c = cupy.asarray(c)
for i in range(0,10):
t0 = time.time()
a_c.dot(b_c, out=c_c)
dev.synchronize()
t1 = time.time()
print("cupy device source & dest", gfcalc(t0,t1))
Run Code Online (Sandbox Code Playgroud)
每次运行的结果以 GFLOP/s 为单位单独打印,因此我们可以看到预热效果。在未知的 Google Colab 会话上运行:
Numba, host source & dest 7.157921582274139
Numba, host source & dest 35.41623777048565
Numba, host source & dest 41.53596793561073
Numba, host source & dest 40.067434107237034
Numba, host source & dest 41.853653713591996
Numba, host source & dest 49.797096688049365
Numba, host source & dest 73.13115945878288
Numba, host source & dest 70.30009174431994
Numba, host source & dest 74.86845532463607
Numba, host source & dest 76.87160119090733
Numba, device source & dest 105.97453061087833
Numba, device source & dest 105.56095086831826
Numba, device source & dest 105.78037879907214
Numba, device source & dest 106.06063296721804
Numba, device source & dest 106.87105343720403
Numba, device source & dest 104.89221337519061
Numba, device source & dest 106.34112058583715
Numba, device source & dest 103.33148924767107
Numba, device source & dest 106.94211047481143
Numba, device source & dest 104.29586223965393
Numpy 81.5965580615561
Numpy 70.34621140682276
Numpy 58.2601841797442
Numpy 79.16676998234227
Numpy 81.37246257365994
Numpy 83.9362524903643
Numpy 76.91820953485447
Numpy 68.72629315606706
Numpy 56.53278637482029
Numpy 76.50855577892254
cupy device source & dest 3298.132279290001
cupy device source & dest 2730.281677702635
cupy device source & dest 4824.423810787891
cupy device source & dest 4698.591160532599
cupy device source & dest 3971.4282428311253
cupy device source & dest 4943.578076147636
cupy device source & dest 3046.0599441126114
cupy device source & dest 2642.1822395837467
cupy device source & dest 3298.132279290001
cupy device source & dest 5144.0315561056495
Run Code Online (Sandbox Code Playgroud)
我们看到了什么:
所以总而言之,你对正在发生的事情的假设是不正确的,这会导致你得出一些误导性的结论。Numpy 的 dot 是一种高性能和最佳实现,但它比运行最先进的 BLAS 实现的专用计算 GPU 慢很多。正确的基准测试本身就是一门科学,除非您完全了解自己在做什么,否则很容易得出错误的结论。
| 归档时间: |
|
| 查看次数: |
2845 次 |
| 最近记录: |