tf.keras.utils.Sequence 中 __getitem__(...) 方法的索引参数是什么?

Tay*_*wte 5 python conv-neural-network keras tensorflow

太长了;制作了自定义 tf.keras.utils.Sequence [ 1 ] 以将批量数据加载到keras.model.fit(...). 尽管超参数/模型/数据结构相同,但生成器的性能比从内存加载的数据调用模型时的性能要差得多。模型过度拟合,因此想知道从model.fit(...)[ 2 ] 方法到__getitem__(..., index)生成器中的方法的索引参数是否会导致相同的图像多次输入模型?如何选择索引参数?是订购的吗?最大索引是由 控制的吗__len(...)__

参考

  1. tf.keras.utils.序列
  2. keras.模型.fit

我正在使用 tf.keras.utils.Sequence [ 1 ]的子类将批量数据提供给 model.fit(...) 方法,如下所示。

class Generator(Sequence):
    
    def __init__(self, df, x, y, file_type, req_dim, directory, batch_size):
        # data info
        self.df = df
        self.x = self.df[x]  # path list to images being loaded
        self.y = self.df[y]  # corresponding target values
        self.index = self.df.index.to_list()
        self.directory = directory  # directory where features images are stored 
        self.file_type = file_type  # dictate which type of image to load 
        # for batches
        self.batch_size = batch_size

    def __len__(self):
        """
        :returns number of batches per epoch
        """
        return int(np.floor(len(self.df) / self.batch_size))

    def __getitem__(self, index):
        """
        receives call from keras (index) and grabs corresponding data batch
        :param index:
        :return:
        """
        # instantiate output array
        x = np.empty((self.batch_size, *self.req_dim))

        # batches
        batch_x = self.x[index*self.batch_size:(index+1)*self.batch_size].to_numpy()
        batch_y = self.y[index*self.batch_size:(index+1)*self.batch_size].to_numpy(dtype=float)

        for i in batch_x:

            # logic to load images + perform operations on them
            im = load(...)
            im = operations(im)
            
            x[i, ] = im  # makes batches of ims

        return tuple((x, batch_y.reshape(-1, 1)))

Run Code Online (Sandbox Code Playgroud)

传统上,我将数据直接加载到内存中,但需要使用上面的 Sequence 子类(类似于生成器,以后将称为生成器)来处理更大的文件大小。为了测试生成器是否工作,我使用了可以直接加载到内存中生成器中的数据。加载到内存中的数据的结果与之前的实验一致,而使用生成器会导致模型过度拟合训练数据。

由于模型过度拟合,我想知道__getitem__(self, index)keras 发送的用于检索给定批次号的索引参数输入是否已排序,或者是否会导致单个图像被多次读取?

该生成器用于以下伪代码:

# load data
data = load_data(...)

# split data according to batch size used later so that each split has an equal number of samples when 
# divided into batches
train, test, val = train_test_split(data) 

scaler = Scaler()
train['target'] = scaler.fit_transform(train['target'])
test['target'] = scaler.transform(test['target'])
val['target'] = scaler.transform(val['target'])

# instantiate generator
train_gen = DataGenerator(train, x='feature_name', y='target', file_type, dims, directory, batch_size=5)

# load validation images and targets directly to memory
val_x = load(...)
val_y = val['target'].to_numpy(dtype=float)


model = model_1(*dims)  # Convolutional neural network that takes in height, width, depth args

model.compile(optimizer=Adam(lr=1e-5, decay=1e-5/400), loss=LogCosh())

history = model.fit(train_gen, validation_data=(val_x, val_y)

pred = model.pred(test)

Run Code Online (Sandbox Code Playgroud)

小智 2

我认为索引号__getitem__与样本总量和你分配的batch_size有关。

例如,我现在正在使用 fer2013plus 数据集,为了测试我有 3944 个图像。我的 test_generator 是这样创建的:

test_generator = ImageDataGenerator().flow_from_directory(test_dir,
                                                  target_size=(48,48),
                                                  color_mode='grayscale',
                                                  batch_size=32,
                                                  class_mode='categorical')
Run Code Online (Sandbox Code Playgroud)

当我调用 时test_generator.__getitem__(),索引是 0 到 123。否则错误会弹出为ValueError: Asked to retrieve element 124, but the Sequence has length 124