为什么神经网络在自己的训练数据上预测错误?

sir*_*jay 26 python machine-learning neural-network keras tensorflow

我制作了一个带有监督学习的 LSTM (RNN) 神经网络,用于数据存量预测。问题是为什么它在自己的训练数据上预测错误?(注意:下面可重现的示例

我创建了一个简单的模型来预测未来 5 天的股价:

model = Sequential()
model.add(LSTM(32, activation='sigmoid', input_shape=(x_train.shape[1], x_train.shape[2])))
model.add(Dense(y_train.shape[1]))
model.compile(optimizer='adam', loss='mse')

es = EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True)
model.fit(x_train, y_train, batch_size=64, epochs=25, validation_data=(x_test, y_test), callbacks=[es])
Run Code Online (Sandbox Code Playgroud)

正确的结果在y_test(5 个值) 中,因此模型训练,回顾前 90 天,然后从最佳 ( val_loss=0.0030) 结果恢复权重patience=3

Train on 396 samples, validate on 1 samples
Epoch 1/25
396/396 [==============================] - 1s 2ms/step - loss: 0.1322 - val_loss: 0.0299
Epoch 2/25
396/396 [==============================] - 0s 402us/step - loss: 0.0478 - val_loss: 0.0129
Epoch 3/25
396/396 [==============================] - 0s 397us/step - loss: 0.0385 - val_loss: 0.0178
Epoch 4/25
396/396 [==============================] - 0s 399us/step - loss: 0.0398 - val_loss: 0.0078
Epoch 5/25
396/396 [==============================] - 0s 391us/step - loss: 0.0343 - val_loss: 0.0030
Epoch 6/25
396/396 [==============================] - 0s 391us/step - loss: 0.0318 - val_loss: 0.0047
Epoch 7/25
396/396 [==============================] - 0s 389us/step - loss: 0.0308 - val_loss: 0.0043
Epoch 8/25
396/396 [==============================] - 0s 393us/step - loss: 0.0292 - val_loss: 0.0056
Run Code Online (Sandbox Code Playgroud)

预测结果非常棒,不是吗?

在此处输入图片说明

那是因为算法从#5 epoch 恢复了最佳权重。好的,现在让我们将此模型保存到.h5文件中,向后移动 -10 天并预测过去 5 天(在第一个示例中,我们在 4 月 17 日至 23 日(包括周末休息日)制作模型并进行验证,现在让我们在 4 月 2 日至 8 日进行测试)。结果:

在此处输入图片说明

它显示了绝对错误的方向。正如我们所看到的,这是因为模型在 4 月 17 日至 23 日进行了训练并在第 5 轮最佳验证集,但不是在 2 月 8 日。如果我尝试更多训练,选择哪个时期进行比赛,无论我做什么,过去总会有很多时间间隔错误预测。

为什么模型在自己的训练数据上显示错误的结果?我训练了数据,它一定记得如何在这块集合上预测数据,但预测错误。我也尝试过:

  • 使用具有 50k+ 行、20 年股票价格的大型数据集,添加或多或少的特征
  • 创建不同类型的模型,例如添加更多隐藏层、不同的batch_sizes、不同的层激活、dropout、batchnormalization
  • 创建自定义 EarlyStopping 回调,从许多验证数据集中获取平均 val_loss 并选择最好的

也许我错过了什么?我可以改进什么?

这是一个非常简单且可重现的示例。yfinance下载标准普尔 500 指数股票数据。

"""python 3.7.7
tensorflow 2.1.0
keras 2.3.1"""


import numpy as np
import pandas as pd
from keras.callbacks import EarlyStopping, Callback
from keras.models import Model, Sequential, load_model
from keras.layers import Dense, Dropout, LSTM, BatchNormalization
from sklearn.preprocessing import MinMaxScaler
import plotly.graph_objects as go
import yfinance as yf
np.random.seed(4)


num_prediction = 5
look_back = 90
new_s_h5 = True # change it to False when you created model and want test on other past dates


df = yf.download(tickers="^GSPC", start='2018-05-06', end='2020-04-24', interval="1d")
data = df.filter(['Close', 'High', 'Low', 'Volume'])

# drop last N days to validate saved model on past
df.drop(df.tail(0).index, inplace=True)
print(df)


class EarlyStoppingCust(Callback):
    def __init__(self, patience=0, verbose=0, validation_sets=None, restore_best_weights=False):
        super(EarlyStoppingCust, self).__init__()
        self.patience = patience
        self.verbose = verbose
        self.wait = 0
        self.stopped_epoch = 0
        self.restore_best_weights = restore_best_weights
        self.best_weights = None
        self.validation_sets = validation_sets

    def on_train_begin(self, logs=None):
        self.wait = 0
        self.stopped_epoch = 0
        self.best_avg_loss = (np.Inf, 0)

    def on_epoch_end(self, epoch, logs=None):
        loss_ = 0
        for i, validation_set in enumerate(self.validation_sets):
            predicted = self.model.predict(validation_set[0])
            loss = self.model.evaluate(validation_set[0], validation_set[1], verbose = 0)
            loss_ += loss
            if self.verbose > 0:
                print('val' + str(i + 1) + '_loss: %.5f' % loss)

        avg_loss = loss_ / len(self.validation_sets)
        print('avg_loss: %.5f' % avg_loss)

        if self.best_avg_loss[0] > avg_loss:
            self.best_avg_loss = (avg_loss, epoch + 1)
            self.wait = 0
            if self.restore_best_weights:
                print('new best epoch = %d' % (epoch + 1))
                self.best_weights = self.model.get_weights()
        else:
            self.wait += 1
            if self.wait >= self.patience or self.params['epochs'] == epoch + 1:
                self.stopped_epoch = epoch
                self.model.stop_training = True
                if self.restore_best_weights:
                    if self.verbose > 0:
                        print('Restoring model weights from the end of the best epoch')
                    self.model.set_weights(self.best_weights)

    def on_train_end(self, logs=None):
        print('best_avg_loss: %.5f (#%d)' % (self.best_avg_loss[0], self.best_avg_loss[1]))


def multivariate_data(dataset, target, start_index, end_index, history_size, target_size, step, single_step=False):
    data = []
    labels = []
    start_index = start_index + history_size
    if end_index is None:
        end_index = len(dataset) - target_size
    for i in range(start_index, end_index):
        indices = range(i-history_size, i, step)
        data.append(dataset[indices])
        if single_step:
            labels.append(target[i+target_size])
        else:
            labels.append(target[i:i+target_size])
    return np.array(data), np.array(labels)


def transform_predicted(pr):
    pr = pr.reshape(pr.shape[1], -1)
    z = np.zeros((pr.shape[0], x_train.shape[2] - 1), dtype=pr.dtype)
    pr = np.append(pr, z, axis=1)
    pr = scaler.inverse_transform(pr)
    pr = pr[:, 0]
    return pr


step = 1

# creating datasets with look back
scaler = MinMaxScaler()
df_normalized = scaler.fit_transform(df.values)
dataset = df_normalized[:-num_prediction]
x_train, y_train = multivariate_data(dataset, dataset[:, 0], 0,len(dataset) - num_prediction + 1, look_back, num_prediction, step)
indices = range(len(dataset)-look_back, len(dataset), step)
x_test = np.array(dataset[indices])
x_test = np.expand_dims(x_test, axis=0)
y_test = np.expand_dims(df_normalized[-num_prediction:, 0], axis=0)

# creating past datasets to validate with EarlyStoppingCust
number_validates = 50
step_past = 5
validation_sets = [(x_test, y_test)]
for i in range(1, number_validates * step_past + 1, step_past):
    indices = range(len(dataset)-look_back-i, len(dataset)-i, step)
    x_t = np.array(dataset[indices])
    x_t = np.expand_dims(x_t, axis=0)
    y_t = np.expand_dims(df_normalized[-num_prediction-i:len(df_normalized)-i, 0], axis=0)
    validation_sets.append((x_t, y_t))


if new_s_h5:
    model = Sequential()
    model.add(LSTM(32, return_sequences=False, activation = 'sigmoid', input_shape=(x_train.shape[1], x_train.shape[2])))
    # model.add(Dropout(0.2))
    # model.add(BatchNormalization())
    # model.add(LSTM(units = 16))
    model.add(Dense(y_train.shape[1]))
    model.compile(optimizer = 'adam', loss = 'mse')

    # EarlyStoppingCust is custom callback to validate each validation_sets and get average
    # it takes epoch with best "best_avg" value
    # es = EarlyStoppingCust(patience = 3, restore_best_weights = True, validation_sets = validation_sets, verbose = 1)

    # or there is keras extension with built-in EarlyStopping, but it validates only 1 set that you pass through fit()
    es = EarlyStopping(monitor = 'val_loss', patience = 3, restore_best_weights = True)

    model.fit(x_train, y_train, batch_size = 64, epochs = 25, shuffle = True, validation_data = (x_test, y_test), callbacks = [es])
    model.save('s.h5')
else:
    model = load_model('s.h5')



predicted = model.predict(x_test)
predicted = transform_predicted(predicted)
print('predicted', predicted)
print('real', df.iloc[-num_prediction:, 0].values)
print('val_loss: %.5f' % (model.evaluate(x_test, y_test, verbose=0)))


fig = go.Figure()
fig.add_trace(go.Scatter(
    x = df.index[-60:],
    y = df.iloc[-60:,0],
    mode='lines+markers',
    name='real',
    line=dict(color='#ff9800', width=1)
))
fig.add_trace(go.Scatter(
    x = df.index[-num_prediction:],
    y = predicted,
    mode='lines+markers',
    name='predict',
    line=dict(color='#2196f3', width=1)
))
fig.update_layout(template='plotly_dark', hovermode='x', spikedistance=-1, hoverlabel=dict(font_size=16))
fig.update_xaxes(showspikes=True)
fig.update_yaxes(showspikes=True)
fig.show()
Run Code Online (Sandbox Code Playgroud)

Ach*_*age 12

OP 假设了一个有趣的发现。让我将原来的问题简化如下。

如果模型是在特定时间序列上训练的,为什么模型不能重建之前已经训练过的时间序列数据?

嗯,答案就嵌入在训练进度本身中。由于EarlyStopping在这里使用是为了避免过度拟合,因此最佳模型保存在epoch=5val_loss=0.0030如 OP 所述。在这种情况下,训练损失等于0.0343,即训练的均方根误差为0.185。由于数据集是使用 缩放的MinMaxScalar,我们需要撤消 RMSE 的缩放以了解发生了什么。

发现时间序列的最小值和最大值是22903380。因此,0.185作为训练的均方根误差意味着,即使对于训练集,预测值也可能与真实值相差大约0.185*(3380-2290),即~200平均单位。

这解释了为什么在前一个时间步预测训练数据本身时存在很大差异。

我应该怎么做才能完美模拟训练数据?

这个问题是我自己问的。简单的答案是,使训练损失接近0,即过度拟合模型。

经过一些训练,我意识到只有 1 个包含32单元的LSTM 层的模型不够复杂,无法重建训练数据。因此,我添加了另一个 LSTM 层,如下所示。

model = Sequential()
model.add(LSTM(32, return_sequences=True, activation = 'sigmoid', input_shape=(x_train.shape[1], x_train.shape[2])))
# model.add(Dropout(0.2))
# model.add(BatchNormalization())
model.add(LSTM(units = 64, return_sequences=False,))
model.add(Dense(y_train.shape[1]))
model.compile(optimizer = 'adam', loss = 'mse')
Run Code Online (Sandbox Code Playgroud)

并且该模型在1000不考虑 的情况下针对epoch 进行训练EarlyStopping

model.fit(x_train, y_train, batch_size = 64, epochs = 1000, shuffle = True, validation_data = (x_test, y_test))
Run Code Online (Sandbox Code Playgroud)

1000epoch结束时,我们的训练损失0.00047远低于您的情况下的训练损失。所以我们希望模型能够更好地重建训练数据。以下是 4 月 2-8 日的预测图。

预言

最后一点:

在特定数据库上训练并不一定意味着模型应该能够完美地重建训练数据。特别是当引入提前停止、正则化和 dropout 等方法来避免过拟合时,模型往往更具泛化性,而不是记忆训练数据。