如何使用多个数据集训练 LSTM？

Question

如何使用多个数据集训练 LSTM？

Mat*_*son 8 machine-learning python-3.x tensorflow

尽管我尽了最大的努力，但我还没有找到这个问题的答案。

\n

I\xe2\x80\x99m 希望使用 Python 3.6 和 TensorFlow 简单地训练 LSTM 网络，使用多个 .csv 文件/数据集，例如使用多家公司的历史股票数据。

\n

这样做的原因是我想让模型适应各种价格范围，而不是在每个数据集上训练单独的模型。我该怎么做呢？

\n

我可以\xe2\x80\x99t 只是将一个数据集附加到另一个创建 1 个大数据集，因为在训练/测试拆分期间，价格可能会从 2 美元跳到 200 美元，具体取决于库存数据以及数据集缝合在一起的位置。

\n

做这样的事情的最佳实践是什么？

\n

只需为每个 .csv 文件创建一个循环，然后调用 .fit 函数，对每个文件依次进行一定数量的训练（更新其权重），并在找到最佳损失后使用早期停止？（我现在明白该怎么做了。）
\n
有没有一种方法可以创建一个生成器，可以以某种方式从每个 .csv 生成不同的 x_train 和 y_train 元组，将模型与每个元组拟合，然后在从每个 .csv 文件中采样一个元组后有一个训练检查点？我的想法是，模型应该有机会在完成一个纪元之前从每个数据集中采样一部分。
\n

\n

示例：let\xe2\x80\x99s 说我想使用 20 周期回溯/窗口大小来预测未来 t+1，并且我有 5 个 .csv 文件可供训练。生成器（理想情况下）将所有数据集加载到内存中，然后从第一个 .csv 文件中随机抽取 20 行样本，将其拟合到模型中，然后从第二个 .csv 文件中抽取另外 20 行，进行拟合，等等，然后一旦对所有 5 个样本进行了采样，就通过检查点来评估损失，然后进入下一个 epoch 并重新进行一遍。

\n

这可能有点矫枉过正，但想要彻底。如果选项 1. 能完成同样的事情，那么 \xe2\x80\x99 对我来说也很好，我只是还没有 \xe2\x80\x99t 找到答案。

\n

谢谢！

\n

更新

\n

自从我提出这个问题以来，我制定解决方案（针对我的特定应用程序）的方法之一是使用下面的代码。基本上，如果我提取几只不同股票过去 5 年的股价数据，我会将一个数据集附加在另一个数据集之上，全部放入一个大数据集中，然后在分配“回顾”后迭代所有行期间，那么 LSTM 应该回顾其特征多少天。然后它会查看日期列，只要每组 10 个特征的日期按升序排列，然后将这些特征聚集在一起以用于 LSTM。但是，如果日期从 2020-09-01 到 2015-09-01，则意味着数据集的该部分是新股票数据的开始位置，因此只需继续向下浏览文件，直到找到 10 行相关内容到一只股票。这将确保 LSTM 的 3D 特征形状仅适用于一只特定股票。

\n

希望这有某种意义。我对该函数的评论非常好，因此应该很容易看出它是如何工作的，然后定义了一个 GRU 模型来展示如何将其付诸实践：

\n

# A function to get a set of X's and Y's for training an LSTM, \n# so long as the dates are in ascending order, so you're not \n# stitching together different datasets X features from two \n# different datasets\n\ndef create_batched_dataset(x, y, time_steps=1): # Not really 1 if defined below, 10 by default\n  \n    x = x.reset_index() # Reset the index column so we can parse the dates\n                        # to determine > or < among the dates\n    x['Date'] = pd.to_datetime(x['Audit_Date']) # make the dates a datetime object\n\n    xs, ys = [], [] # lists for our features/labels for LSTM\n\n    for i in range(len(x) - time_steps): # Range 0 to 430 in my dataset\n\n        v = x.iloc[i:(i + time_steps), :] # v = first 10 rows of X set\n\n        if v['Date'].iloc[-1] <= v['Date'].iloc[0]: # Only batch from one training dataset, not where they stitch together.\n                           # This checks that the last date and first date\n                       # of the 10 rows are in the order they should be\n\n            continue\n\n        v = v.set_index(['Date']) # Set the index again\n\n        xs.append(v.iloc[:, :-1].to_numpy()) # Append those 10 rows to your Xs list, without the target label in it\n\n        ys.append(y.iloc[i + time_steps]) # Append their corresponding labels to Ys list, then continue\n\n    return np.array(xs), np.array(ys) # np.array(xs\n\n\n# Get our reshaped features/labels (to [samples, time_steps, n_features])\nx_train, y_train = create_batched_dataset(train_scaled, train_scaled.iloc[:,-1], 10)\nx_test, y_test = create_batched_dataset(test_scaled, test_scaled.iloc[:,-1], 10)\n\n\n# Define some type of LSTM model\nmodel = Sequential()\nmodel.add(GRU(11, input_shape=(x_train.shape[1], x_train.shape[2])))\nmodel.add(Dense(11, activation="relu"))\nmodel.add(Dense(1))\nmodel.compile(loss='mae', optimizer=Adam(1 / 1000))\nprint(model.summary())\n

Run Code Online (Sandbox Code Playgroud)\n

更新 2 \n这是使用列表的另一个解决方案。基本上，对于每个股票代码，我们都有一个数据集，导入 df，并将其股票价格数据添加到各个列表中，然后将这些列表添加到一个主列表中。然后，当您准备好训练时，从主列表中随机提取股票价格列表以输入到您的神经网络中。请注意，您必须在 NN 函数中定义 open =prices[0]、high=prices[1] 等。希望有帮助：

\n

prices_library = []\nfor ticker in list_of_tickers: # Used for multiple tickers\n    print(ticker)\n\n    df = pd.read_csv('./' + ticker + '_' + interval + 'm.csv')\n    \n    open = df['Open'].values.tolist()\n    high = df['High'].values.tolist()\n    low = df['Low'].values.tolist()\n    close = df['Close'].values.tolist()\n    volume = df['Volume'].values.tolist()\n\n    prices_library.append([date,\n                            open,\n                            high,\n                            low,\n                            close,\n                            volume])\n\nfor i in range(len(prices_library) * iterations):\n    print('Iteration: ' + str(i+1) + ' of ' + str(len(prices_library) * iterations))\n    agent.train(iterations=1, checkpoint=1, initial_money=initial_money, prices=prices_library[random.randint(0,len(prices_library)-1)])\n

Run Code Online (Sandbox Code Playgroud)\n

Answer 1

小智 1

将所有 CSV 合并到一个文件中，并为其提供足够的步骤以覆盖所有这些文件。如果进行预处理，则应在一个训练文件中创建序列，每个序列包含一行，其中每个序列包含给定 CSV 的 20 个左右的先前周期。这样，当它们被随机输入模型时，每个序列都对应于正确的库存

归档时间：	5 年，10 月前
查看次数：	4071 次
最近记录：	4 年，10 月前