resnet50迁移学习期间的大量过拟合

Question

resnet50迁移学习期间的大量过拟合

mor*_*nor 6 python conv-neural-network keras tensorflow resnet

这是我第一次尝试使用CNN做某事，因此我可能做的很愚蠢-但无法弄清楚我错了...

该模型似乎学习得很好，但是验证准确性并没有提高（甚至-在第一个时期之后），并且验证损失实际上随着时间而增加。看起来我不太适合（1个时期后？）-我们必须以其他方式关闭。

我正在训练一个CNN网络-我有约100k种各种植物（1000个类）的图像，并想对ResNet50进行微调以创建一个多类分类器。图片大小各异，我像这样加载它们：

from keras.preprocessing import image                  

def path_to_tensor(img_path):
    # loads RGB image as PIL.Image.Image type
    img = image.load_img(img_path, target_size=(IMG_HEIGHT, IMG_HEIGHT))
    # convert PIL.Image.Image type to 3D tensor with shape (IMG_HEIGHT, IMG_HEIGHT, 3)
    x = image.img_to_array(img)
    # convert 3D tensor to 4D tensor with shape (1, IMG_HEIGHT, IMG_HEIGHT, 3) and return 4D tensor
    return np.expand_dims(x, axis=0)

def paths_to_tensor(img_paths):
    list_of_tensors = [path_to_tensor(img_path) for img_path in img_paths] #can use tqdm(img_paths) for data
    return np.vstack(list_of_tensors)enter code here

Run Code Online (Sandbox Code Playgroud)

数据库很大（不适合内存），必须创建自己的生成器才能提供从磁盘读取和扩充的功能。（我知道Keras具有.flow_from_directory（）-但我的数据不是以这种方式结构化的-它只是将100k图像与100k元数据文件混合在一起的转储）。我可能应该创建一个脚本来更好地构建它们，而不是创建自己的生成器，但是问题可能出在其他地方。

下面的生成器版本暂时不做任何扩充-只是重新缩放：

def generate_batches_from_train_folder(images_to_read, labels, batchsize = BATCH_SIZE):    

    #Generator that returns batches of images ('xs') and labels ('ys') from the train folder
    #:param string filepath: Full filepath of files to read - this needs to be a list of image files
    #:param np.array: list of all labels for the images_to_read - those need to be one-hot-encoded
    #:param int batchsize: Size of the batches that should be generated.
    #:return: (ndarray, ndarray) (xs, ys): Yields a tuple which contains a full batch of images and labels. 

    dimensions = (BATCH_SIZE, IMG_HEIGHT, IMG_HEIGHT, 3)

    train_datagen = ImageDataGenerator(
        rescale=1./255,
        #rotation_range=20,
        #zoom_range=0.2, 
        #fill_mode='nearest',
        #horizontal_flip=True
    )

    # needs to be on a infinite loop for the generator to work
    while 1:
        filesize = len(images_to_read)

        # count how many entries we have read
        n_entries = 0
        # as long as we haven't read all entries from the file: keep reading
        while n_entries < (filesize - batchsize):

            # start the next batch at index 0
            # create numpy arrays of input data (features) 
            # - this is already shaped as a tensor (output of the support function paths_to_tensor)
            xs = paths_to_tensor(images_to_read[n_entries : n_entries + batchsize])

            # and label info. Contains 1000 labels in my case for each possible plant species
            ys = labels[n_entries : n_entries + batchsize]

            # we have read one more batch from this file
            n_entries += batchsize

            #perform online augmentation on the xs and ys
            augmented_generator = train_datagen.flow(xs, ys, batch_size = batchsize)

        yield  next(augmented_generator)

Run Code Online (Sandbox Code Playgroud)

这是我定义模型的方式：

def get_model():

    # define the model
    base_net = ResNet50(input_shape=DIMENSIONS, weights='imagenet', include_top=False)

    # Freeze the layers which you don't want to train. Here I am freezing all of them
    for layer in base_net.layers:
        layer.trainable = False

    x = base_net.output

    #for resnet50
    x = Flatten()(x)
    x = Dense(512, activation="relu")(x)
    x = Dropout(0.5)(x)
    x = Dense(1000, activation='softmax', name='predictions')(x)

    model = Model(inputs=base_net.input, outputs=x)

    # compile the model 
    model.compile(
        loss='categorical_crossentropy',
        optimizer=optimizers.Adam(1e-3),
        metrics=['acc'])

    return model

Run Code Online (Sandbox Code Playgroud)

因此，作为结果，我对大约70k图像具有1,562,088个可训练参数

然后，我使用5折交叉验证，但是该模型在任何折痕上均不起作用，因此我将不在此处包括完整的代码，相关的内容如下：

trial_fold = temp_model.fit_generator(
                train_generator,
                steps_per_epoch = len(X_train_path) // BATCH_SIZE,
                epochs = 50,
                verbose = 1,
                validation_data = (xs_v,ys_v),#valid_generator,
                #validation_steps= len(X_valid_path) // BATCH_SIZE,
                callbacks = callbacks,
                shuffle=True)

Run Code Online (Sandbox Code Playgroud)

我做了各种各样的事情-确保我的发电机确实在工作，通过减小完全连接的层的大小尝试使用网络的最后几层，尝试增强-没有任何帮助...

我认为网络中的参数数量不会太大-我知道其他人几乎做了同样的事情，并且精度接近0.5，但是我的模型似乎过分疯狂。任何有关如何解决此问题的想法将不胜感激！

更新1：

我已决定停止重新发明内容，并按文件排序以使用.flow_from_directory（）过程。为了确保导入正确的格式（由下面的Ioannis Nasios注释触发）-我确保从keras的resnet50应用程序中获取preprocessing_unit（）。

我还决定检查模型是否确实在产生有用的东西-我为数据集计算了botleneck特征，然后使用随机森林来预测类。它确实有效，我的准确度约为0.4

因此，我想我的图像输入格式肯定有问题。下一步，我将对模型进行微调（带有一个新的顶层），以查看问题是否仍然存在...

更新2：

我认为问题在于图像预处理。最后，我最终没有进行微调，只是提取了botleneck层并训练linear_SVC（）-获得了大约60％的训练和大约45％的测试数据集的准确性。

Answer 1

Ses*_*ism 6

您需要在 ImageDataGenerator 中使用 preprocessing_function 参数。

 train_datagen = ImageDataGenerator(preprocessing_function=keras.applications.resnet50.preprocess_input)

Run Code Online (Sandbox Code Playgroud)

这将确保您的图像按照您正在使用的预训练网络的预期进行预处理。

Answer 2

grz*_*700 0

问题是每个类别的数据集太小。100k 个示例/1000 个类 = 每类约 100 个示例。金额太小了。您的网络可以记住权重矩阵中的所有示例，但为了泛化，您应该有更多的示例。尝试仅使用最常见的类并找出发生了什么。

归档时间：	7 年，5 月前
查看次数：	3611 次
最近记录：	6 年，8 月前