onnxruntime 推理在 GPU 上比 pytorch 慢得多

sn7*_*710 5 gpu machine-learning pytorch onnx onnxruntime

我正在比较使用 pytorch 和 onnxruntime 的输入的推理时间,我发现 onnxruntime 在 GPU 上实际上较慢,而在 CPU 上则明显更快

我在 Windows 10 上尝试过这个。

  • 从源安装的 ONNX 运行时 - ONNX 运行时版本:1.11.0(onnx 版本 1.10.1)
  • Python 版本 - 3.8.12
  • CUDA/cuDNN 版本 - cuda 版本 11.5、cudnn 版本 8.2
  • GPU 型号和内存 - Quadro M2000M,4 GB

相关代码-

import torch
from torchvision import models
import onnxruntime    # to inference ONNX models, we use the ONNX Runtime
import onnx
import os
import time

batch_size = 1
total_samples = 1000
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
    
def convert_to_onnx(resnet):
   resnet.eval()
   dummy_input = (torch.randn(batch_size, 3, 224, 224, device=device)).to(device=device)
   input_names = [ 'input' ]
   output_names = [ 'output' ]
   torch.onnx.export(resnet, 
               dummy_input,
               "resnet18.onnx",
               verbose=True,
               opset_version=13,
               input_names=input_names,
               output_names=output_names,
               export_params=True,
               do_constant_folding=True,
               dynamic_axes={
                  'input': {0: 'batch_size'},  # variable length axes
                  'output': {0: 'batch_size'}}        
               )
                  
def infer_pytorch(resnet):
   print('Pytorch Inference')
   print('==========================')
   print()

   x = torch.randn((batch_size, 3, 224, 224))
   x = x.to(device=device)

   latency = []
   for i in range(total_samples):
      t0 = time.time()
      resnet.eval()
      with torch.no_grad():
         out = resnet(x)
      latency.append(time.time() - t0)

   print('Number of runs:', len(latency))
   print("Average PyTorch {} Inference time = {} ms".format(device.type, format(sum(latency) * 1000 / len(latency), '.2f')))  

def to_numpy(tensor):
   return tensor.detach().cpu().numpy() if tensor.requires_grad else tensor.cpu().numpy()

def infer_onnxruntime():
   print('Onnxruntime Inference')
   print('==========================')
   print()

   onnx_model = onnx.load("resnet18.onnx")
   onnx.checker.check_model(onnx_model)

   # Input
   x = torch.randn((batch_size, 3, 224, 224))
   x = x.to(device=device)
   x = to_numpy(x)

   so = onnxruntime.SessionOptions()
   so.execution_mode = onnxruntime.ExecutionMode.ORT_SEQUENTIAL
   so.graph_optimization_level = onnxruntime.GraphOptimizationLevel.ORT_ENABLE_ALL
   
   exproviders = ['CUDAExecutionProvider', 'CPUExecutionProvider']

   model_onnx_path = os.path.join(".", "resnet18.onnx")
   ort_session = onnxruntime.InferenceSession(model_onnx_path, so, providers=exproviders)

   options = ort_session.get_provider_options()
   cuda_options = options['CUDAExecutionProvider']
   cuda_options['cudnn_conv_use_max_workspace'] = '1'
   ort_session.set_providers(['CUDAExecutionProvider'], [cuda_options])

   #IOBinding
   input_names = ort_session.get_inputs()[0].name
   output_names = ort_session.get_outputs()[0].name
   io_binding = ort_session.io_binding()

   io_binding.bind_cpu_input(input_names, x)
   io_binding.bind_output(output_names, device)
   
   #warm up run
   ort_session.run_with_iobinding(io_binding)
   ort_outs = io_binding.copy_outputs_to_cpu()

   latency = []

   for i in range(total_samples):
      t0 = time.time()
      ort_session.run_with_iobinding(io_binding)
      latency.append(time.time() - t0)
      ort_outs = io_binding.copy_outputs_to_cpu()
   print('Number of runs:', len(latency))
   print("Average onnxruntime {} Inference time = {} ms".format(device.type, format(sum(latency) * 1000 / len(latency), '.2f')))   

if __name__ == '__main__':
   torch.cuda.empty_cache()
   resnet = (models.resnet18(pretrained=True)).to(device=device)
   convert_to_onnx(resnet)
   infer_onnxruntime()
   infer_pytorch(resnet)
Run Code Online (Sandbox Code Playgroud)

输出

如果在CPU上运行,

Average onnxruntime cpu Inference time = 18.48 ms
Average PyTorch cpu Inference time = 51.74 ms
Run Code Online (Sandbox Code Playgroud)

但是,如果在 GPU 上运行,我明白了

Average onnxruntime cuda Inference time = 47.89 ms
Average PyTorch cuda Inference time = 8.94 ms
Run Code Online (Sandbox Code Playgroud)

如果我将图形优化更改为 onnxruntime.GraphOptimizationLevel.ORT_DISABLE_ALL,我会看到 GPU 上的推理时间有所改善,但它仍然比 Pytorch 慢。

我对输入张量 numpy 数组使用 io 绑定,模型的节点位于 GPU 上。

此外,在 onnxruntime 的处理过程中,我打印设备使用情况统计信息,并看到了这一点 -

Using device: cuda:0
GPU Device name: Quadro M2000M
Memory Usage:
Allocated: 0.1 GB
Cached:    0.1 GB
Run Code Online (Sandbox Code Playgroud)

因此,正在使用 GPU 设备。

此外,我使用了 ModelZoo 中的 resnet18.onnx 模型来查看是否是转换模式问题,但我得到了相同的结果。

我在这里做错了什么或错过了什么?

Igo*_*gor 1

计算推理时间时,排除所有应该像resnet.eval()循环一样运行一次的代码。

请在示例中包含导入

import torch
from torchvision import models
import onnxruntime    # to inference ONNX models, we use the ONNX Runtime
import onnx
import os
import time
Run Code Online (Sandbox Code Playgroud)

仅运行您的示例 GPU 后,我发现时间仅相差 x2,因此速度差异可能是由框架特性引起的。有关更多详细信息,请探索onnx 转化优化

Onnxruntime Inference
==========================

Number of runs: 1000
Average onnxruntime cuda Inference time = 4.76 ms
Pytorch Inference
==========================

Number of runs: 1000
Average PyTorch cuda Inference time = 2.27 ms
Run Code Online (Sandbox Code Playgroud)