如何在接近零的时间内用火炬将两个10000 * 10000矩阵相乘？为什么速度从349 ms下降到999 µs如此之大？

Question

如何在接近零的时间内用火炬将两个10000 * 10000矩阵相乘？为什么速度从349 ms下降到999 µs如此之大？

use*_*264 10 python performance jupyter-notebook pytorch

这是Jupyter的摘录：

在[1]：

import torch, numpy as np, datetime
cuda = torch.device('cuda')

Run Code Online (Sandbox Code Playgroud)

在[2]：

ac = torch.randn(10000, 10000).to(cuda)
bc = torch.randn(10000, 10000).to(cuda)
%time cc = torch.matmul(ac, bc)
print(cc[0, 0], torch.sum(ac[0, :] * bc[:, 0]))

Run Code Online (Sandbox Code Playgroud)

挂墙时间：349毫秒

张量（17.0374，device ='cuda：0'）张量（17.0376，device ='cuda：0'）

时间很短，但仍然很合理（1e12乘法的时间为0.35秒）

但是，如果我们重复同样的话：

ac = torch.randn(10000, 10000).to(cuda)
bc = torch.randn(10000, 10000).to(cuda)
%time cc = torch.matmul(ac, bc)
print(cc[0, 0], torch.sum(ac[0, :] * bc[:, 0]))

Run Code Online (Sandbox Code Playgroud)

壁挂时间：999 µs

张量（-78.7172，device ='cuda：0'）张量（-78.7173，device ='cuda：0'）

1e12乘法1ms？！

为什么时间从349ms变为1ms？

信息：

在GeForce RTX 2070上测试；
可以在Google Colab上复制。

Answer 1

Ber*_*iel 11

在讨论PyTorch上已经对此进行了讨论：测量GPU张量运行速度。

我想强调该线程的两个评论：

来自@apaszke：

[...] GPU异步执行所有操作，因此您需要插入适当的屏障以使基准测试正确

来自@ngimel：

我相信现在cublas句柄的分配是延迟的，这意味着需要cublas句柄的第一个操作将具有创建cublas句柄的开销，并且其中包括一些内部分配。因此，除了在计时循环之前调用某些需要cublas的函数外，没有其他方法可以避免这种情况。

基本上，您必须进行synchronize()适当的度量：

import torch

x = torch.randn(10000, 10000).to("cuda")
w = torch.randn(10000, 10000).to("cuda")
# ensure that context initialization finish before you start measuring time
torch.cuda.synchronize()

%time y = x.mm(w.t()); torch.cuda.synchronize()

Run Code Online (Sandbox Code Playgroud)

CPU时间：用户288毫秒，系统时间：191毫秒，总计：479毫秒

挂墙时间：492毫秒

x = torch.randn(10000, 10000).to("cuda")
w = torch.randn(10000, 10000).to("cuda")
# ensure that context initialization finish before you start measuring time
torch.cuda.synchronize()

%time y = x.mm(w.t()); torch.cuda.synchronize()

Run Code Online (Sandbox Code Playgroud)

CPU时间：用户237毫秒，系统时间：231毫秒，总计：468毫秒

挂墙时间：469毫秒

归档时间：	6 年，7 月前
查看次数：	416 次
最近记录：	6 年，7 月前