为什么运行时不能随 flops 扩展 - Pointwise Multiplication vs 2D Convolutions

Ale*_*der 5 python pytorch

背景

基于傅里叶变换的卷积定理,空间域中的卷积等效于傅里叶域中的逐点乘法(反之亦然)。我torch.nn.Conv2d通过在 PyTorch 中执行逐点乘法而不是卷积(使用转换为输入大小的内核)在傅立叶域中实现了“操作”(如此处所述:https ://arxiv.org/pdf/ 1312.5851.pdf )

期望与结果

我发现它表现不佳,类似于:Keras/Tensorflow - conv2d 的傅里叶逐点乘法实现,运行速度比空间卷积慢 4 倍

经过多次基准测试后,逐点乘法似乎是该操作的主要瓶颈。在基准测试期间,我排除了 FFT 过程以隔离层的操作(并使用适当大小的保存内核)。

这是令人困惑的,因为在考虑 2D 卷积(步幅 = 1)和逐元素乘法所需的 FLOP 数量时:

  • Conv2d FLOP: Kernel_H * Kernel_W * C_in * C_out * H * W
  • 逐点 FLOP: C_in * C_out * H * W

例如给出H = 32, W = 60, C_in = 64, C_out = 256

  • Conv2d FLOPs (k = 16): 16 * 16 * 32 * 60 * 64 * 256 = 8053 MFLOPs
  • 逐点(FLOP): 64 * 256 * 32 * 60 = 31.46 MFLOPs

考虑到 FLOP 的巨大差异,我预计 2D 卷积需要更长的时间才能运行(已阅读 GPU 针对点积进行了很好的优化)

我创建了一个简单的脚本来的基准点乘法torch.Tensor反对torch.nn.Conv2d,因为它似乎表现得相当或更慢的乘法的elementwise相比二维卷积。

下面是在 CPU 和 GPU 上的 2 个这样的基准测试结果的概述(i9900k with torch.set_num_threads(1)

结果 - CPU (i9900k)

(# Kernel Size = 16)

Benchmark Overview (device = cpu):
    Number of test iterations: 100
    Number of warm-up iterations: 5
    Pointwise: [1, 256, 32, 60] * [64, 256, 32, 60]
    Conv2d(in_ch=256, out_ch=64, kernel_size=16): Conv2d([1, 256, 32, 60)
    FLOP Estimation:
        Conv2d:      8053.06368 MFlops
        Pointwise:   31.45728 MFlops


Benchmark Results (device = cpu)
    Pointwise:   16.139 +/- 0.786 ms
    Conv2d:      12.947 +/- 0.784 ms

-------------------------
(# Kernel Size = 5)

Benchmark Overview (device = cpu):
    Number of test iterations: 100
    Number of warm-up iterations: 5
    Pointwise: [1, 256, 32, 60] * [64, 256, 32, 60]
    Conv2d(in_ch=256, out_ch=64, kernel_size=5): Conv2d([1, 256, 32, 60)
    FLOP Estimation:
        Conv2d:      786.432 MFlops
        Pointwise:   31.45728 MFlops

Benchmark Results (device = cpu)
    Pointwise:   36.085 +/- 3.668 ms
    Conv2d:      9.344 +/- 0.952 ms

Run Code Online (Sandbox Code Playgroud)

结果 - GPU(RTX Titan)

(# Kernel Size = 16)

Benchmark Overview (device = cuda:1):
    Number of test iterations: 1000
    Number of warm-up iterations: 5
    Pointwise: [1, 256, 32, 60] * [64, 256, 32, 60]
    Conv2d(in_ch=256, out_ch=64, kernel_size=16): Conv2d([1, 256, 32, 60)
    FLOP Estimation:
        Conv2d:      8053.06368 MFlops
        Pointwise:   31.45728 MFlops

Benchmark Results (device = cuda:1)
    Pointwise:   0.698 +/- 0.031 ms
    Conv2d:      2.916 +/- 0.161 ms

------------------------------------

(# Kernel size = 3)

Benchmark Overview (device = cuda:1):
    Number of test iterations: 100
    Number of warm-up iterations: 5
    Pointwise: [1, 256, 32, 60] * [64, 256, 32, 60]
    Conv2d(in_ch=256, out_ch=64, kernel_size=3): Conv2d([1, 256, 32, 60)
    FLOP Estimation:
        Conv2d:      283.11552 MFlops
        Pointwise:   31.45728 MFlops
        FreqConv:    62.91456 MFlops

Benchmark Results (device = cuda:1)
    Pointwise:   0.681 +/- 0.011 ms
    Conv2d:      0.126 +/- 0.034 ms


Run Code Online (Sandbox Code Playgroud)

如果我更改HW或 通道,结果似乎没有显着变化。但是对于较小的内核,逐点出现明显更慢。

任何人都可以提出为什么当 FLOP 至少大 2 个数量级时,逐点乘法似乎如此缓慢,或者我的想法或代码中可能存在错误?

基准实施

import torch
import numpy as np
from torch import nn
from time import time

torch.set_num_threads(1)

in_ch = 256
out_ch = 64
height = 32
width = 60
kernel_size = 16

warmup = 5
iters = 100

flops_pointwise = (out_ch * in_ch * height * width)
m_flops_conv = (flops_pointwise * kernel_size ** 2) / 1e6
m_flops_pw = (flops_pointwise) / 1e6

# Device to run benchmark on, e.g. 'cpu' or 'cuda:X'
device = 'cpu'

print(f'Benchmark Overview (device = {device}):')
print(f'\tNumber of test iterations: {iters}')
print(f'\tNumber of warm-up iterations: {warmup}')
print(f'\tPointwise: [1, {in_ch}, {height}, {width}] * [{out_ch}, {in_ch}, {height}, {width}]') 
print(f'\tConv2d(in_ch={in_ch}, out_ch={out_ch}, kernel_size={kernel_size}): Conv2d([1, {in_ch}, {height}, {width})')

print('\tFLOP Estimation:')
print(f'\t\tConv2d:\t\t {m_flops_conv} MFlops')
print(f'\t\tPointwise:\t {m_flops_pw} MFlops')
print(f'\t\tFreqConv:\t {m_flops_freq_conv} MFlops')

print()

def benchmark(input_gen, operation, warmup=5, iters=1000):
    duration = []
    for i in range(iters + warmup):

        input = input_gen()

        start = time() # start timer
        with torch.no_grad():
            operation(input)

        # Sync if using cuda
        if device[:4] == 'cuda':
            torch.cuda.synchronize(device)
        end = time() # end timer

        if i < warmup:
            continue

        duration.append((end - start) * 1e3) # ms

    return np.array(duration)


def pointwise(input):
    x, y = input
    x * y

# Helper methods to generate new data
# for every iteration inside of the benchmark method

def _gen_pw_input(in_ch, out_ch, height, width):
    x = torch.rand(1, in_ch, height, width).to(device)
    k = torch.randn(out_ch, in_ch, height, width).to(device)
    return x, k

gen_pw_input = lambda : _gen_pw_input(in_ch, out_ch, height, width)

def _gen_conv_input(in_ch, out_ch, height, width):
    x = torch.rand(1, in_ch, height, width).to(device)
    return x

gen_conv_input = lambda : _gen_conv_input(in_ch, out_ch, height, width)



conv2d = nn.Conv2d(in_ch, out_ch, kernel_size=kernel_size).to(device)

pw_res = benchmark(gen_pw_input, pointwise, warmup=warmup, iters=iters)
conv_res = benchmark(gen_conv_input, conv2d, warmup=warmup, iters=iters)

print(f'Benchmark Results (device = {device})')
print('\tPointwise:\t {:.3f} +/- {:.3f} ms'.format(pw_res.mean(), pw_res.std()))
print('\tConv2d:\t\t {:.3f} +/- {:.3f} ms'.format(conv_res.mean(), conv_res.std()))
Run Code Online (Sandbox Code Playgroud)

本征

我还在 Eigen (C++) 中实现了一个基本的基准测试来比较元素乘法,它与 PyTorch 中观察到的结果相似(稍慢);PyTorch 使用的后端 BLAS 似乎经过优化。