如何在训练 CNN 期间删除重复项？

Question

如何在训练 CNN 期间删除重复项？

Yid*_*dne 3 python image-processing deep-learning conv-neural-network keras

我正在使用 CNN 解决图像分类问题。我有一个包含重复图像的图像数据集。当我用这些数据训练 CNN 时，它已经过拟合了。因此，我需要删除那些重复项。

Answer 1

我们松散地称为重复的东西可能很难被算法识别出来。您的重复项可以是：

完全重复
近乎精确的重复。（图像等的小编辑）
感知重复（相同的内容，但不同的视图、相机等）

No1 & 2 更容易解决。否 3. 非常主观，仍然是一个研究课题。我可以为 No1 & 2 提供一个解决方案。这两个解决方案都使用了优秀的图像散列库：https : //github.com/JohannesBuchner/imagehash

精确重复可以使用感知散列度量找到精确重复。phash 库在这方面做得很好。我经常用它来清理训练数据。用法（来自 github 站点）非常简单：

from PIL import Image
import imagehash

# image_fns : List of training image files
img_hashes = {}

for img_fn in sorted(image_fns):
    hash = imagehash.average_hash(Image.open(image_fn))
    if hash in img_hashes:
        print( '{} duplicate of {}'.format(image_fn, img_hashes[hash]) )
    else:
        img_hashes[hash] = image_fn

Run Code Online (Sandbox Code Playgroud)

Near-Exact Duplicates 在这种情况下，您必须设置一个阈值并比较散列值之间的距离。这必须通过对图像内容的反复试验来完成。

from PIL import Image
import imagehash

# image_fns : List of training image files
img_hashes = {}
epsilon = 50

for img_fn1, img_fn2 in zip(image_fns, image_fns[::-1]):
    if image_fn1 == image_fn2:
        continue

    hash1 = imagehash.average_hash(Image.open(image_fn1))
    hash2 = imagehash.average_hash(Image.open(image_fn2))
    if hash1 - hash2 < epsilon:
        print( '{} is near duplicate of {}'.format(image_fn1, image_fn2) )

Run Code Online (Sandbox Code Playgroud)

归档时间：	6 年，7 月前
查看次数：	1458 次
最近记录：	4 年，10 月前