tf.keras.utils.Sequence 中 getitem(...) 方法的索引参数是什么？

Question

tf.keras.utils.Sequence 中 getitem(...) 方法的索引参数是什么？

Tay*_*wte 5 python conv-neural-network keras tensorflow

太长了；制作了自定义 tf.keras.utils.Sequence [ 1 ] 以将批量数据加载到keras.model.fit(...). 尽管超参数/模型/数据结构相同，但生成器的性能比从内存加载的数据调用模型时的性能要差得多。模型过度拟合，因此想知道从model.fit(...)[ 2 ] 方法到__getitem__(..., index)生成器中的方法的索引参数是否会导致相同的图像多次输入模型？如何选择索引参数？是订购的吗？最大索引是由控制的吗__len(...)__？

参考

我正在使用 tf.keras.utils.Sequence [ 1 ]的子类将批量数据提供给 model.fit(...) 方法，如下所示。

class Generator(Sequence):
    
    def __init__(self, df, x, y, file_type, req_dim, directory, batch_size):
        # data info
        self.df = df
        self.x = self.df[x]  # path list to images being loaded
        self.y = self.df[y]  # corresponding target values
        self.index = self.df.index.to_list()
        self.directory = directory  # directory where features images are stored 
        self.file_type = file_type  # dictate which type of image to load 
        # for batches
        self.batch_size = batch_size

    def __len__(self):
        """
        :returns number of batches per epoch
        """
        return int(np.floor(len(self.df) / self.batch_size))

    def __getitem__(self, index):
        """
        receives call from keras (index) and grabs corresponding data batch
        :param index:
        :return:
        """
        # instantiate output array
        x = np.empty((self.batch_size, *self.req_dim))

        # batches
        batch_x = self.x[index*self.batch_size:(index+1)*self.batch_size].to_numpy()
        batch_y = self.y[index*self.batch_size:(index+1)*self.batch_size].to_numpy(dtype=float)

        for i in batch_x:

            # logic to load images + perform operations on them
            im = load(...)
            im = operations(im)
            
            x[i, ] = im  # makes batches of ims

        return tuple((x, batch_y.reshape(-1, 1)))

Run Code Online (Sandbox Code Playgroud)

传统上，我将数据直接加载到内存中，但需要使用上面的 Sequence 子类（类似于生成器，以后将称为生成器）来处理更大的文件大小。为了测试生成器是否工作，我使用了可以直接加载到内存中和生成器中的数据。加载到内存中的数据的结果与之前的实验一致，而使用生成器会导致模型过度拟合训练数据。

由于模型过度拟合，我想知道__getitem__(self, index)keras 发送的用于检索给定批次号的索引参数输入是否已排序，或者是否会导致单个图像被多次读取？

该生成器用于以下伪代码：

# load data
data = load_data(...)

# split data according to batch size used later so that each split has an equal number of samples when 
# divided into batches
train, test, val = train_test_split(data) 

scaler = Scaler()
train['target'] = scaler.fit_transform(train['target'])
test['target'] = scaler.transform(test['target'])
val['target'] = scaler.transform(val['target'])

# instantiate generator
train_gen = DataGenerator(train, x='feature_name', y='target', file_type, dims, directory, batch_size=5)

# load validation images and targets directly to memory
val_x = load(...)
val_y = val['target'].to_numpy(dtype=float)


model = model_1(*dims)  # Convolutional neural network that takes in height, width, depth args

model.compile(optimizer=Adam(lr=1e-5, decay=1e-5/400), loss=LogCosh())

history = model.fit(train_gen, validation_data=(val_x, val_y)

pred = model.pred(test)

Run Code Online (Sandbox Code Playgroud)

Answer 1

小智 2

我认为索引号__getitem__与样本总量和你分配的batch_size有关。

例如，我现在正在使用 fer2013plus 数据集，为了测试我有 3944 个图像。我的 test_generator 是这样创建的：

test_generator = ImageDataGenerator().flow_from_directory(test_dir,
                                                  target_size=(48,48),
                                                  color_mode='grayscale',
                                                  batch_size=32,
                                                  class_mode='categorical')

Run Code Online (Sandbox Code Playgroud)

当我调用时test_generator.__getitem__()，索引是 0 到 123。否则错误会弹出为ValueError: Asked to retrieve element 124, but the Sequence has length 124

归档时间：	5 年前
查看次数：	2391 次
最近记录：	4 年前

tf.keras.utils.Sequence 中 __getitem__(...) 方法的索引参数是什么？

tf.keras.utils.Sequence 中 getitem(...) 方法的索引参数是什么？