Keras 模型训练内存泄漏

Sly*_*ark 6 python memory checkpoint keras tensorflow

我是 Keras、Tensorflow、Python 的新手,我正在尝试构建一个供个人使用/未来学习的模型。我刚开始使用 python,我想出了这段代码(在视频和教程的帮助下)。我的问题是,我对 Python 的内存使用量随着每个 epoch 甚至在构建新模型之后慢慢增加。一旦内存达到 100%,训练就会停止,没有错误/警告。我不太了解,但问题应该在循环内的某个地方(如果我没记错的话)。我知道

k.clear.session()

但问题要么没有被删除,要么我不知道如何将它集成到我的代码中。我有:Python v 3.6.4、Tensorflow 2.0.0rc1(cpu 版本)、Keras 2.3.0

这是我的代码:

import pandas as pd
import os
import time
import tensorflow as tf
import numpy as np
import keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, LSTM, BatchNormalization
from tensorflow.keras.callbacks import TensorBoard, ModelCheckpoint

EPOCHS = 25
BATCH_SIZE = 32           

df = pd.read_csv("EntryData.csv", names=['1SH5', '1SHA', '1SA5', '1SAA', '1WH5', '1WHA',
                                         '2SA5', '2SAA', '2SH5', '2SHA', '2WA5', '2WAA',
                                         '3R1', '3R2', '3R3', '3R4', '3R5', '3R6',
                                         'Target'])

df_val = 14554 

validation_df = df[df.index > df_val]
df = df[df.index <= df_val]

train_x = df.drop(columns=['Target'])
train_y = df[['Target']]
validation_x = validation_df.drop(columns=['Target'])
validation_y = validation_df[['Target']]

train_x = np.asarray(train_x)
train_y = np.asarray(train_y)
validation_x = np.asarray(validation_x)
validation_y = np.asarray(validation_y)

train_x = train_x.reshape(train_x.shape[0], 1, train_x.shape[1])
validation_x = validation_x.reshape(validation_x.shape[0], 1, validation_x.shape[1])

dense_layers = [0, 1, 2]
layer_sizes = [32, 64, 128]
conv_layers = [1, 2, 3]

for dense_layer in dense_layers:
    for layer_size in layer_sizes:
        for conv_layer in conv_layers:
            NAME = "{}-conv-{}-nodes-{}-dense-{}".format(conv_layer, layer_size, 
                    dense_layer, int(time.time()))
            tensorboard = TensorBoard(log_dir="logs\{}".format(NAME))
            print(NAME)

            model = Sequential()
            model.add(LSTM(layer_size, input_shape=(train_x.shape[1:]), 
                                       return_sequences=True))
            model.add(Dropout(0.2))
            model.add(BatchNormalization())

            for l in range(conv_layer-1):
                model.add(LSTM(layer_size, return_sequences=True))
                model.add(Dropout(0.1))
                model.add(BatchNormalization())

            for l in range(dense_layer):
                model.add(Dense(layer_size, activation='relu'))
                model.add(Dropout(0.2))

            model.add(Dense(2, activation='softmax'))

            opt = tf.keras.optimizers.Adam(lr=0.001, decay=1e-6)

            # Compile model
            model.compile(loss='sparse_categorical_crossentropy',
                          optimizer=opt,
                          metrics=['accuracy'])

            # unique file name that will include the epoch 
            # and the validation acc for that epoch
            filepath = "RNN_Final.{epoch:02d}-{val_accuracy:.3f}"  
            checkpoint = ModelCheckpoint("models\{}.model".format(filepath, 
                         monitor='val_acc', verbose=0, save_best_only=True, 
                         mode='max')) # saves only the best ones

            # Train model
            history = model.fit(
                train_x, train_y,
                batch_size=BATCH_SIZE,
                epochs=EPOCHS,
                validation_data=(validation_x, validation_y),
                callbacks=[tensorboard, checkpoint])

# Score model
score = model.evaluate(validation_x, validation_y, verbose=2)
print('Test loss:', score[0])
print('Test accuracy:', score[1])
# Save model
model.save("models\{}".format(NAME))
Run Code Online (Sandbox Code Playgroud)

此外,我不知道是否可以在 1 个问题中提出 2 个问题(我不想在这里用我的问题发送垃圾邮件,任何有任何 Python 经验的人都可以在一分钟内解决),但我也有问题检查点保存。我只想保存性能最好的模型(每个 1 NN 规范 1 个模型 - 节点/层数),但目前它在每个 epoch 之后保存。如果这不合适,我可以为此创建另一个问题。

非常感谢您的帮助。

Ove*_*gon 5

这个问题的一个原因是,一个新的循环model = Sequential()不会删除以前的模型; 它仍然在其 TensorFlow 图范围内构建,并且每个新model = Sequential()添加了另一个最终会溢出内存的挥之不去的结构。为确保模型被完全销毁,请在完成模型后运行以下命令:

import gc
del model
gc.collect()
K.clear_session()
tf.compat.v1.reset_default_graph() # TF graph isn't same as Keras graph
Run Code Online (Sandbox Code Playgroud)

gc是 Python 的垃圾收集模块,用于清除modelafter 的残余痕迹delK.clear_session()是主要调用,并清除 TensorFlow 图。

此外,虽然您对模型检查点、日志记录和超参数搜索的想法很合理,但执行起来却很错误;实际上,您将只为在那里设置的整个嵌套循环测试一个超参数组合。但这应该在一个单独的问题中提出。


更新:刚刚在完全正确设置的环境中遇到了同样的问题;最可能的结论是,这是一个错误 - 一个明确的罪魁祸首是Eager execution。要解决,请使用

tf.compat.v1.disable_eager_execution() # right after `import tensorflow as tf`
Run Code Online (Sandbox Code Playgroud)

切换到图形模式,该模式也可以显着加快运行速度。另请参阅上面更新的清除代码。