如何使用分层抽样将图像文件夹拆分为测试/训练/验证集?

Yue*_*rno 7 python python-3.x

我有一个非常大的图像文件夹,以及一个包含每个图像的类标签的 CSV 文件。因为它们都在一个巨大的文件夹中,所以我想将它们分成训练/测试/验证集;也许创建三个新文件夹并将图像移动到基于某种 Python 脚本的每个文件夹中。我想做分层抽样,以便我可以在所有三个集合中保持类的百分比相同。

制作可以执行此操作的脚本的方法是什么?

AVI*_*AIN 13

使用 python 库拆分文件夹。

pip install split-folders
Run Code Online (Sandbox Code Playgroud)

让所有图像存储在Data文件夹中。然后申请如下:

import split_folders
split_folders.ratio('Data', output="output", seed=1337, ratio=(.8, 0.1,0.1)) 
Run Code Online (Sandbox Code Playgroud)

在运行上面的代码片段时,它将在output目录中创建 3 个文件夹:

  • 火车
  • 测试

使用ratio参数中的值可以改变每个文件夹中的图像数量(train:val:test)

  • 它不是“split_folders”,而是“splitfolders”。正确的代码是:`import splitfolders splitfolders.ratio('Data',output=”output”,seed=1337,ratio=(.8,0.1,0.1))` (6认同)
  • 这个小图书馆是真正的瑰宝。我一整天都在拔头发,直到我偶然发现了这一点。一定要喜欢 SO 和 Github。 (2认同)

小智 8

我自己也遇到了类似的问题。我所有的图像都存储在两个文件夹中。“项目/Data2/DPN+”和“项目/Data2/DPN-”。这是一个二元分类问题。这两个类是“DPN+”和“DPN-”。这两个类文件夹中都有 .png 。我的目标是将数据集分发到培训、验证和测试文件夹中。这些新文件夹中的每一个都将有另外 2 个文件夹——“DPN+”和“DPN-”——在它们里面指示类。对于分区,我使用了 70:15:15 分配。我是python的初学者,所以如果我犯了任何错误,请告诉我。

以下是我的代码:

import os
import numpy as np
import shutil

# # Creating Train / Val / Test folders (One time use)
root_dir = 'Data2'
posCls = '/DPN+'
negCls = '/DPN-'

os.makedirs(root_dir +'/train' + posCls)
os.makedirs(root_dir +'/train' + negCls)
os.makedirs(root_dir +'/val' + posCls)
os.makedirs(root_dir +'/val' + negCls)
os.makedirs(root_dir +'/test' + posCls)
os.makedirs(root_dir +'/test' + negCls)

# Creating partitions of the data after shuffeling
currentCls = posCls
src = "Data2"+currentCls # Folder to copy images from

allFileNames = os.listdir(src)
np.random.shuffle(allFileNames)
train_FileNames, val_FileNames, test_FileNames = np.split(np.array(allFileNames),
                                                          [int(len(allFileNames)*0.7), int(len(allFileNames)*0.85)])


train_FileNames = [src+'/'+ name for name in train_FileNames.tolist()]
val_FileNames = [src+'/' + name for name in val_FileNames.tolist()]
test_FileNames = [src+'/' + name for name in test_FileNames.tolist()]

print('Total images: ', len(allFileNames))
print('Training: ', len(train_FileNames))
print('Validation: ', len(val_FileNames))
print('Testing: ', len(test_FileNames))

# Copy-pasting images
for name in train_FileNames:
    shutil.copy(name, "Data2/train"+currentCls)

for name in val_FileNames:
    shutil.copy(name, "Data2/val"+currentCls)

for name in test_FileNames:
    shutil.copy(name, "Data2/test"+currentCls)
Run Code Online (Sandbox Code Playgroud)


小智 7

采纳史蒂文·怀特(Steven White)上面的答案并对其进行一些修改,因为分裂存在一个小问题。此外,这些文件分别保存在主文件夹中,而不是 train/test/val 文件夹中。

import os
import numpy as np
import shutil
import pandas as pd


def train_test_split():
    print("########### Train Test Val Script started ###########")
    #data_csv = pd.read_csv("DataSet_Final.csv") ##Use if you have classes saved in any .csv file

    root_dir = 'New_folder_to_be_created'
    classes_dir = ['class 1', 'class 2', 'class 3', 'class 4']

    #for name in data_csv['names'].unique()[:10]:
    #    classes_dir.append(name)

    processed_dir = 'Existing_folder_to_take_images_from'

    val_ratio = 0.20
    test_ratio = 0.20

    for cls in classes_dir:
        # Creating partitions of the data after shuffeling
        print("$$$$$$$ Class Name " + cls + " $$$$$$$")
        src = processed_dir +"//" + cls  # Folder to copy images from

        allFileNames = os.listdir(src)
        np.random.shuffle(allFileNames)
        train_FileNames, val_FileNames, test_FileNames = np.split(np.array(allFileNames),
                                                                  [int(len(allFileNames) * (1 - (val_ratio + test_ratio))),
                                                                   int(len(allFileNames) * (1 - val_ratio)),
                                                                   ])

        train_FileNames = [src + '//' + name for name in train_FileNames.tolist()]
        val_FileNames = [src + '//' + name for name in val_FileNames.tolist()]
        test_FileNames = [src + '//' + name for name in test_FileNames.tolist()]

        print('Total images: '+ str(len(allFileNames)))
        print('Training: '+ str(len(train_FileNames)))
        print('Validation: '+  str(len(val_FileNames)))
        print('Testing: '+ str(len(test_FileNames)))

        # # Creating Train / Val / Test folders (One time use)
        os.makedirs(root_dir + '/train//' + cls)
        os.makedirs(root_dir + '/val//' + cls)
        os.makedirs(root_dir + '/test//' + cls)

        # Copy-pasting images
        for name in train_FileNames:
            shutil.copy(name, root_dir + '/train//' + cls)

        for name in val_FileNames:
            shutil.copy(name, root_dir + '/val//' + cls)

        for name in test_FileNames:
            shutil.copy(name, root_dir + '/test//' + cls)

    print("########### Train Test Val Script Ended ###########")

train_test_split()
Run Code Online (Sandbox Code Playgroud)


Bha*_*ari 6

更新(2022):

我开发了一个名为 python_splitter 的 python 包,可以在一行中自动执行整个过程。这将自动生成 Train-Test-Val 或 Train-Test 文件夹。了解更多: https: //github.com/bharatadk/python_splitter

! pip install python_splitter 
import python_splitter
python_splitter.split_from_folder("SOURCE_FOLDER", train=0.5, test=0.3, val=0.2)
Run Code Online (Sandbox Code Playgroud)

旧方法(手动过程)

**I have made better code which you have to run once **

## I made this for TB vs Normal image datasets by improving above code
## import libraries

import os
import numpy as np
import shutil
import random

# creating train / val /test
root_dir = 'TB_Chest_Radiography_Database/'
new_root = 'AllDatasets/'
classes = ['Normal', 'Tuberculosis']

for cls in classes:
    os.makedirs(root_dir + new_root+ 'train/' + cls)
    os.makedirs(root_dir +new_root +'val/' + cls)
    os.makedirs(root_dir +new_root + 'test/' + cls)
    
## creating partition of the data after shuffeling

for cls in classes:
    src = root_dir + cls # folder to copy images from
    print(src)

    allFileNames = os.listdir(src)
    np.random.shuffle(allFileNames)

    ## here 0.75 = training ratio , (0.95-0.75) = validation ratio , (1-0.95) =  
    ##training ratio  
    train_FileNames,val_FileNames,test_FileNames = np.split(np.array(allFileNames),[int(len(allFileNames)*0.75),int(len(allFileNames)*0.95)])

    # #Converting file names from array to list

    train_FileNames = [src+'/'+ name for name in train_FileNames]
    val_FileNames = [src+'/' + name for name in val_FileNames]
    test_FileNames = [src+'/' + name for name in test_FileNames]

    print('Total images  : '+ cls + ' ' +str(len(allFileNames)))
    print('Training : '+ cls + ' '+str(len(train_FileNames)))
    print('Validation : '+ cls + ' ' +str(len(val_FileNames)))
    print('Testing : '+ cls + ' '+str(len(test_FileNames)))
    
    ## Copy pasting images to target directory

    for name in train_FileNames:
        shutil.copy(name, root_dir + new_root+'train/'+cls )


    for name in val_FileNames:
        shutil.copy(name, root_dir +new_root+'val/'+cls )


    for name in test_FileNames:
        shutil.copy(name,root_dir + new_root+'test/'+cls )


  
Run Code Online (Sandbox Code Playgroud)