使用CUDNN_STATUS_ALLOC_FAILED的Tensorflow崩溃

Question

使用CUDNN_STATUS_ALLOC_FAILED的Tensorflow崩溃

Gno*_*ske 1 python neural-network python-3.x tensorflow

一直在网上搜索数小时，没有任何结果，所以我想在这里问。

我正在尝试按照Sentdex的教程制作自动驾驶汽车，但是在运行模型时，会遇到很多致命错误。我已经在整个互联网上搜索了解决方案，许多似乎都遇到了同样的问题。但是，我发现的所有解决方案（包括此Stack-post）都不适合我。

这是我的软件：

Tensorflow：1.5，GPU版本
CUDA：9.0，带有补丁
CUDnn：7
Windows 10专业版
Python 3.6

硬件：

Nvidia 1070ti，带有最新驱动程序
英特尔i5 7600K

这是崩溃日志：

2018-02-04 16:29:33.606903: E C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\stream_executor\cuda\cuda_blas.cc:444] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED 2018-02-04 16:29:33.608872: E C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\stream_executor\cuda\cuda_blas.cc:444] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED 2018-02-04 16:29:33.609308: E C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\stream_executor\cuda\cuda_blas.cc:444] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED 2018-02-04 16:29:35.145249: E C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\stream_executor\cuda\cuda_dnn.cc:385] could not create cudnn handle: CUDNN_STATUS_ALLOC_FAILED 2018-02-04 16:29:35.145563: E C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\stream_executor\cuda\cuda_dnn.cc:352] could not destroy cudnn handle: CUDNN_STATUS_BAD_PARAM 2018-02-04 16:29:35.149896: F C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\kernels\conv_ops.cc:717] Check failed: stream->parent()->GetConvolveAlgorithms( conv_parameters.ShouldIncludeWinogradNonfusedAlgo<T>(), &algorithms)

这是我的代码：

 import tensorflow as tf
    import numpy as np
    import cv2
    import time
    from PIL import ImageGrab
    from getkeys import key_check
    from alexnet import alexnet
    import os
    from sendKeys import PressKey, ReleaseKey, W,A,S,D,Sp

    import random

    WIDTH = 80
    HEIGHT = 60
    LR = 1e-3
    EPOCHS = 10
    MODEL_NAME = 'DiRT-AI-Driver-{}-{}-{}-epochs.model'.format(LR, 'alexnetv2', EPOCHS)

    def straight():
        PressKey(W)
        ReleaseKey(A)
        ReleaseKey(S)
        ReleaseKey(D)
        ReleaseKey(Sp)
    def left():
        PressKey(A)
        ReleaseKey(W)
        ReleaseKey(S)
        ReleaseKey(D)
        ReleaseKey(Sp)
    def right():
        PressKey(D)
        ReleaseKey(A)
        ReleaseKey(S)
        ReleaseKey(W)
        ReleaseKey(Sp)
    def brake():
        PressKey(S)
        ReleaseKey(A)
        ReleaseKey(W)
        ReleaseKey(D)
        ReleaseKey(Sp)
    def handbrake():
        PressKey(Sp)
        ReleaseKey(A)
        ReleaseKey(S)
        ReleaseKey(D)
        ReleaseKey(W)

    model = alexnet(WIDTH, HEIGHT, LR)
    model.load(MODEL_NAME)


    def main():
        last_time = time.time()
        for i in list(range(4))[::-1]:
            print(i+1)
            time.sleep(1)


    paused = False
    while(True):
            if not paused:
                screen = np.array(ImageGrab.grab(bbox=(0,40,1024,768)))
                screen = cv2.cvtColor(screen,cv2.COLOR_BGR2GRAY)
                screen = cv2.resize(screen,(80,60))
                print('Loop took {} seconds'.format(time.time()-last_time))
                last_time = time.time()
                print('took time')
                prediction = model.predict([screen.reshape(WIDTH,HEIGHT,1)])[0]
                print('predicted')
                moves = list(np.around(prediction))
                print('got moves')
                print(moves,prediction)

                if moves == [1,0,0,0,0]:
                    straight()
                elif moves == [0,1,0,0,0]:
                    left()
                elif moves == [0,0,1,0,0]:
                    brake()
                elif moves == [0,0,0,1,0]:
                    right()
                elif moves == [0,0,0,0,1]:
                    handbrake()

            keys = key_check()

            if 'T' in keys:
                if paused:
                    pased = False
                    time.sleep(1)
                else:
                    paused = True
                    ReleaseKey(W)
                    ReleaseKey(A)
                    ReleaseKey(S)
                    ReleaseKey(D)
                    ReleaseKey(Sp)
                    time.sleep(1)


main()

Run Code Online (Sandbox Code Playgroud)

我发现使python崩溃并产生前三个bug的行是以下行：

prediction = model.predict([screen.reshape(WIDTH,HEIGHT,1)])[0]

运行代码时，CPU的运行速度高达100％，这表明有严重问题。GPU约占40-50％

我已经尝试过Tensorflow 1.2和1.3以及CUDA 8，效果不佳。安装CUDA时，我不安装特定的驱动程序，因为它们对于我的GPU而言太旧了。也尝试过不同的CUDnn，效果不好。

Answer 1

sta*_*iet 8

可能您的 GPU 内存不足。

如果您使用的是 TensorFlow 1.x：

第一个选项）设置allow_growth为 true。

import tensorflow as tf    
config = tf.ConfigProto()
config.gpu_options.allow_growth=True
sess = tf.Session(config=config)

Run Code Online (Sandbox Code Playgroud)

第二个选项）设置内存分数。

# change the memory fraction as you want

import tensorflow as tf
gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.3)
sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options))

Run Code Online (Sandbox Code Playgroud)

如果您使用的是 TensorFlow 2.x：

第一个选项）设置set_memory_growth为 true。

# Currently the ‘memory growth’ option should be the same for all GPUs.
# You should set the ‘memory growth’ option before initializing GPUs.

import tensorflow as tf
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
  try:
    for gpu in gpus:
      tf.config.experimental.set_memory_growth(gpu, True)
  except RuntimeError as e:
    print(e)

Run Code Online (Sandbox Code Playgroud)

第二个选项）memory_limit根据需要设置。只需在下面的代码中更改 gpus 和 memory_limit 的索引即可。

import tensorflow as tf
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
  try:
    tf.config.experimental.set_virtual_device_configuration(gpus[0], [tf.config.experimental.VirtualDeviceConfiguration(memory_limit=1024)])
  except RuntimeError as e:
    print(e)

Run Code Online (Sandbox Code Playgroud)

Answer 2

Axe*_*uig 7

就我而言，发生此问题是因为tensorflow正在运行另一个导入的python控制台。关闭它可以解决问题。

我有Windows 10，主要错误是：

无法创建cublas句柄：CUBLAS_STATUS_ALLOC_FAILED

无法创建Cudnn句柄：CUDNN_STATUS_ALLOC_FAILED

Answer 3

小智 1

尝试将cuda路径添加到环境变量中。看来问题出在cuda上。

在 ~/.bashrc 中设置 CUDA 路径（使用 nano 编辑）：

#Cuda Nvidia path
$ export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/usr/local/cuda/lib64"
$ export CUDA_HOME=/usr/local/cuda

Run Code Online (Sandbox Code Playgroud)

归档时间：	7 年，11 月前
查看次数：	6548 次
最近记录：	6 年，8 月前