为什么 pytorch matmul 在 cpu 和 gpu 上执行时得到不同的结果？

Question

为什么 pytorch matmul 在 cpu 和 gpu 上执行时得到不同的结果？

我试图找出 numpy/pytorch、gpu/cpu、float16/float32 数字之间的舍入差异，而我发现的内容让我感到困惑。

基本版本是：

a = torch.rand(3, 4, dtype=torch.float32)
b = torch.rand(4, 5, dtype=torch.float32)
print(a.numpy()@b.numpy() - a@b)

Run Code Online (Sandbox Code Playgroud)

然而，结果如预期全为零

print((a.cuda()@b.cuda()).cpu() - a@b)

Run Code Online (Sandbox Code Playgroud)

得到非零结果。为什么 Pytorch float32 matmul 在 gpu 和 cpu 上执行不同？

一个更令人困惑的实验涉及 float16，如下所示：


a = torch.rand(3, 4, dtype=torch.float16)
b = torch.rand(4, 5, dtype=torch.float16)
print(a.numpy()@b.numpy() - a@b)
print((a.cuda()@b.cuda()).cpu() - a@b)

Run Code Online (Sandbox Code Playgroud)

这两个结果均非零。为什么 numpy 和 torch 处理 float16 数字的方式不同？我知道cpu只能进行float32运算，并且numpy在计算之前将float16转换为float32，但是torch计算也在cpu上执行。

你猜怎么着，print((a.cuda()@b.cuda()).cpu() - a.numpy()@b.numpy()) 得到全零结果！这对我来说纯粹是幻想……

环境如下：

蟒蛇：3.8.5
火炬：1.7.0
numpy：1.21.2
CUDA：11.1
显卡：GeForce RTX 3090

根据一些评论者的建议，我添加了以下平等测试

(a.numpy()@b.numpy() - (a@b).numpy()).any()
((a.cuda()@b.cuda()).cpu() - a@b).numpy().any()
(a.numpy()@b.numpy() - (a@b).numpy()).any()
((a.cuda()@b.cuda()).cpu() - a@b).numpy().any()
((a.cuda()@b.cuda()).cpu().numpy() - a.numpy()@b.numpy()).any()

Run Code Online (Sandbox Code Playgroud)

分别直接执行上述五个打印函数，结果为：

False
True
True
True
False

Run Code Online (Sandbox Code Playgroud)

对于最后一个，我已经尝试过好几次了，我想我可以排除运气的可能性。

Answer 1

hkc*_*rex 1

正如 @talonmies 所提到的，差异主要是数字上的。CPU/GPU 及其各自的 BLAS 库的实现方式不同，并且使用不同的操作/操作顺序，因此存在数值差异。

一种可能的原因是顺序操作与减少（https://discuss.pytorch.org/t/why- Different-results -when-multiplying-in-cpu-than-in-gpu/1356/3），例如（（ (a+b)+c)+d) 与 ((a+b)+(c+d)) 相比将具有不同的数值属性。

这个问题还讨论了可能导致数值差异的融合运算（乘加）。

我做了一些测试，发现如果我们在计算前将数据类型提升为float32，然后将其降级，GPU在float16模式下的输出是可以匹配的。这可能是由内部中间铸造或融合操作更好的数值稳定性引起的（torch.backends.cudnn.enabled无关紧要）。但这并不能解决 float32 中的情况。

import torch

def test(L, M, N):
    # test (L*M) @ (M*N)
    for _ in range(5000):
        a = torch.rand(L, M, dtype=torch.float16)
        b = torch.rand(M, N, dtype=torch.float16)

        cpu_result = a@b
        gpu_result = (a.cuda()@b.cuda()).cpu()
        if (cpu_result-gpu_result).any():
            print(f'({L}x{M}) @ ({M}x{N}) failed')
            return
    else:
        print(f'({L}x{M}) @ ({M}x{N}) passed')


test(1, 1, 1)
test(1, 2, 1)
test(4, 1, 4)
test(4, 4, 4)

def test2():
    for _ in range(5000):
        a = torch.rand(1, 2, dtype=torch.float16)
        b = torch.rand(2, 1, dtype=torch.float16)

        cpu_result = a@b
        gpu_result = (a.cuda()@b.cuda()).cpu()

        half_result = a[0,0]*b[0,0] + a[0,1]*b[1,0]
        convert_result = (a[0,0].float()*b[0,0].float() + a[0,1].float()*b[1,0].float()).half()

        if ((cpu_result-half_result).any()):
            print('CPU != half')
            return
        if (gpu_result-convert_result).any():
            print('GPU != convert')
            return
    else:
        print('All passed')

test2()

Run Code Online (Sandbox Code Playgroud)

输出：

(1x1) @ (1x1) passed
(1x2) @ (2x1) failed
(4x1) @ (1x4) passed
(4x4) @ (4x4) failed
All passed

Run Code Online (Sandbox Code Playgroud)

您可以看出，当内部维度为时1，它通过了检查（不需要乘法加法/归约）。

归档时间：	4 年，4 月前
查看次数：	2570 次
最近记录：	4 年，4 月前