从 keras.preprocessing API 将张量流数据集拆分为训练集、测试集和验证集

Question

从 keras.preprocessing API 将张量流数据集拆分为训练集、测试集和验证集

我是 Tensorflow/keras 的新手，我的文件结构有 3000 个文件夹，每个文件夹包含 200 个图像，每个图像要作为数据加载。我知道keras.preprocessing.image_dataset_from_directory允许我加载数据并将其分成训练/验证集，如下所示：

val_data = tf.keras.preprocessing.image_dataset_from_directory('etlcdb/ETL9G_IMG/', 
                                                           image_size = (128, 127),
                                                           validation_split = 0.3,
                                                           subset = "validation",
                                                           seed = 1,
                                                           color_mode = 'grayscale',
                                                           shuffle = True)

Run Code Online (Sandbox Code Playgroud)

找到属于 3036 个类的 607200 个文件。使用 182160 个文件进行验证。

但随后我不确定如何在保持适当的类的同时将我的验证进一步拆分为测试拆分。据我所知（通过 GitHub源代码），take 方法只是获取数据集的前 x 个元素，而skip 则执行相同的操作。我不确定这是否会保持数据的分层，并且我不太确定如何从数据集中返回标签来测试它。

任何帮助，将不胜感激。

Answer 1

Son*_*Das 11

You almost got the answer. The key is to use .take() and .skip() to further split the validation set into 2 datasets -- one for validation and the other for test. If I use your example, then you need to execute the following lines of codes. Let's assume that you need 70% for training set, 10% for validation set, and 20% for test set. For the sake of completeness, I am also including the step to generate the training set. Let's also assign a few basic variables that must be same when first splitting the entire data set into training and validation sets.

seed_train_validation = 1 # Must be same for train_ds and val_ds
shuffle_value = True
validation_split = 0.3

train_ds = tf.keras.utils.image_dataset_from_directory(
directory ='etlcdb/ETL9G_IMG/',
image_size = (128, 127),
validation_split = validation_split,
subset = "training",
seed = seed_train_validation,
color_mode = 'grayscale',
shuffle = shuffle_value)

val_ds = tf.keras.utils.image_dataset_from_directory(
directory ='etlcdb/ETL9G_IMG/',
image_size = (128, 127),
validation_split = validation_split,
subset = "validation",
seed = seed_train_validation,
color_mode = 'grayscale',
shuffle = shuffle_value)

Run Code Online (Sandbox Code Playgroud)

Next, determine how many batches of data are available in the validation set using tf.data.experimental.cardinality, and then move the two-third of them (2/3 of 30% = 20%) to a test set as follows. Note that the default value of batch_size is 32 (re: documentation).

val_batches = tf.data.experimental.cardinality(val_ds)
test_ds = val_ds.take((2*val_batches) // 3)
val_ds = val_ds.skip((2*val_batches) // 3)

Run Code Online (Sandbox Code Playgroud)

All the three datasets (train_ds, val_ds, and test_ds) yield batches of images together with labels inferred from the directory structure. So, you are good to go from here.

Answer 2

Car*_*lus 3

我找不到支持文档，但我相信image_dataset_from_directory将数据集的末尾部分作为验证分割。shuffle现在默认设置True为，因此数据集在训练之前会被打乱，以避免仅使用某些类进行验证分割。所做的分割image_dataset_from_directory仅与训练过程有关。如果您需要（强烈推荐）测试拆分，您应该事先将数据拆分为训练和测试。然后，image_dataset_from_directory将您的训练数据分为训练和验证。

我通常采用较小的百分比 (10%) 进行训练中验证，并将原始数据集分为 80% 训练和 20% 测试。使用这些值，最终的分割（从初始数据集大小）为：

80%训练：
- 72% 训练（用于调整网络中的权重）
- 8% 训练中验证（仅用于在每个时期后检查模型的指标）
20% 测试（训练过程中从未见过）

此问题中还有如何拆分目录中的数据的其他信息：Keras split train test set when using ImageDataGenerator

归档时间：	5 年前
查看次数：	15823 次
最近记录：	3 年，2 月前