Tensorflow:无法在服务器中创建会话

Pre*_*rko 2 python tensorflow

我在Keras开发了一个模型,并训练了很多次.一旦我强行停止模型的训练,从那时起我收到以下错误:

Traceback (most recent call last):
  File "inception_resnet.py", line 246, in <module>
    callbacks=[checkpoint, saveEpochNumber])   ##
  File "/home/eh0/E27890/anaconda3/lib/python3.5/site-packages/keras/legacy/interfaces.py", line 87, in wrapper
    return func(*args, **kwargs)
  File "/home/eh0/E27890/anaconda3/lib/python3.5/site-packages/keras/engine/training.py", line 2042, in fit_generator
    class_weight=class_weight)
  File "/home/eh0/E27890/anaconda3/lib/python3.5/site-packages/keras/engine/training.py", line 1762, in train_on_batch
    outputs = self.train_function(ins)
  File "/home/eh0/E27890/anaconda3/lib/python3.5/site-packages/keras/backend/tensorflow_backend.py", line 2270, in __call__
    session = get_session()
  File "/home/eh0/E27890/anaconda3/lib/python3.5/site-packages/keras/backend/tensorflow_backend.py", line 163, in get_session
    _SESSION = tf.Session(config=config)
  File "/home/eh0/E27890/anaconda3/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1486, in __init__
    super(Session, self).__init__(target, graph, config=config)
  File "/home/eh0/E27890/anaconda3/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 621, in __init__
    self._session = tf_session.TF_NewDeprecatedSession(opts, status)
  File "/home/eh0/E27890/anaconda3/lib/python3.5/contextlib.py", line 66, in __exit__
    next(self.gen)
  File "/home/eh0/E27890/anaconda3/lib/python3.5/site-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status
    pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.InternalError: Failed to create session.
Run Code Online (Sandbox Code Playgroud)

所以错误实际上就是这样

tensorflow.python.framework.errors_impl.InternalError:无法创建会话.

最有可能的是,GPU内存仍然被占用.我甚至无法创建一个简单的tensorflow会话.

我在这里看到了答案,但是当我在终端中执行以下命令时

export CUDA_VISIBLE_DEVICES=''
Run Code Online (Sandbox Code Playgroud)

无需GPU加速即可开始模型训练.

此外,当我在服务器上训练我的模型并且我没有对服务器的root访问权限时,我无法重新启动服务器或清除具有root访问权限的GPU内存.现在的解决方案是什么?

Pre*_*rko 6

我在这个问题的评论中找到了解决方案.

nvidia-smi -q
Run Code Online (Sandbox Code Playgroud)

这给出了占用GPU内存的所有进程(及其PID)的列表.我用它逐个杀死了它们

kill -9 PID
Run Code Online (Sandbox Code Playgroud)

现在一切都顺利了.