Tay*_*wte 5 python conv-neural-network keras tensorflow
太长了;制作了自定义 tf.keras.utils.Sequence [ 1 ] 以将批量数据加载到keras.model.fit(...). 尽管超参数/模型/数据结构相同,但生成器的性能比从内存加载的数据调用模型时的性能要差得多。模型过度拟合,因此想知道从model.fit(...)[ 2 ] 方法到__getitem__(..., index)生成器中的方法的索引参数是否会导致相同的图像多次输入模型?如何选择索引参数?是订购的吗?最大索引是由 控制的吗__len(...)__?
参考
我正在使用 tf.keras.utils.Sequence [ 1 ]的子类将批量数据提供给 model.fit(...) 方法,如下所示。
class Generator(Sequence):
def __init__(self, df, x, y, file_type, req_dim, directory, batch_size):
# data info
self.df = df
self.x = self.df[x] # path list to images being loaded
self.y = self.df[y] # corresponding target values
self.index = self.df.index.to_list()
self.directory = directory # directory where features images are stored
self.file_type = file_type # dictate which type of image to load
# for batches
self.batch_size = batch_size
def __len__(self):
"""
:returns number of batches per epoch
"""
return int(np.floor(len(self.df) / self.batch_size))
def __getitem__(self, index):
"""
receives call from keras (index) and grabs corresponding data batch
:param index:
:return:
"""
# instantiate output array
x = np.empty((self.batch_size, *self.req_dim))
# batches
batch_x = self.x[index*self.batch_size:(index+1)*self.batch_size].to_numpy()
batch_y = self.y[index*self.batch_size:(index+1)*self.batch_size].to_numpy(dtype=float)
for i in batch_x:
# logic to load images + perform operations on them
im = load(...)
im = operations(im)
x[i, ] = im # makes batches of ims
return tuple((x, batch_y.reshape(-1, 1)))
Run Code Online (Sandbox Code Playgroud)
传统上,我将数据直接加载到内存中,但需要使用上面的 Sequence 子类(类似于生成器,以后将称为生成器)来处理更大的文件大小。为了测试生成器是否工作,我使用了可以直接加载到内存中和生成器中的数据。加载到内存中的数据的结果与之前的实验一致,而使用生成器会导致模型过度拟合训练数据。
由于模型过度拟合,我想知道__getitem__(self, index)keras 发送的用于检索给定批次号的索引参数输入是否已排序,或者是否会导致单个图像被多次读取?
该生成器用于以下伪代码:
# load data
data = load_data(...)
# split data according to batch size used later so that each split has an equal number of samples when
# divided into batches
train, test, val = train_test_split(data)
scaler = Scaler()
train['target'] = scaler.fit_transform(train['target'])
test['target'] = scaler.transform(test['target'])
val['target'] = scaler.transform(val['target'])
# instantiate generator
train_gen = DataGenerator(train, x='feature_name', y='target', file_type, dims, directory, batch_size=5)
# load validation images and targets directly to memory
val_x = load(...)
val_y = val['target'].to_numpy(dtype=float)
model = model_1(*dims) # Convolutional neural network that takes in height, width, depth args
model.compile(optimizer=Adam(lr=1e-5, decay=1e-5/400), loss=LogCosh())
history = model.fit(train_gen, validation_data=(val_x, val_y)
pred = model.pred(test)
Run Code Online (Sandbox Code Playgroud)
小智 2
我认为索引号__getitem__与样本总量和你分配的batch_size有关。
例如,我现在正在使用 fer2013plus 数据集,为了测试我有 3944 个图像。我的 test_generator 是这样创建的:
test_generator = ImageDataGenerator().flow_from_directory(test_dir,
target_size=(48,48),
color_mode='grayscale',
batch_size=32,
class_mode='categorical')
Run Code Online (Sandbox Code Playgroud)
当我调用 时test_generator.__getitem__(),索引是 0 到 123。否则错误会弹出为ValueError: Asked to retrieve element 124, but the Sequence has length 124
| 归档时间: |
|
| 查看次数: |
2391 次 |
| 最近记录: |