将图像数据集拆分为训练测试数据集

Ish*_*xit 8 training-data python-3.x train-test-split

所以我有一个包含子文件夹的主文件夹,子文件夹又包含数据集的图像,如下所示。

-main_db

---CLASS_1

-----img_1

-----img_2

-----img_3

-----img_4

---CLASS_2

-----img_1

-----img_2

-----img_3

-----img_4

---CLASS_3

-----img_1

-----img_2

-----img_3

-----img_4

我需要将这个数据集分成两部分,即训练数据(70%)和测试数据(30%)。下面是我想要实现的层次结构

-main_db

---training_data

-----CLASS_1

-------img_1

-------img_2

-------img_3

-------img_4

---CLASS_2

-------img_1

-------img_2

-------img_3

-------img_4

---testing_data

-----CLASS_1

-------img_5

-------img_6

-------img_7

-------img_8

---CLASS_2

-------img_5

-------img_6

-------img_7

-------img_8

任何帮助表示赞赏。谢谢

我试过这个模块。但这对我不起作用。该模块根本没有被导入。

https://github.com/jfilter/split-folders

这正是我想要的。

小智 13

如果您不太热衷于编码,可以使用一个名为 split-folders 的 python 包。它非常容易使用,可以在这里找到 它是如何使用的。

pip install split-folders
import split_folders # or import splitfolders
input_folder = "/path/to/input/folder"
output = "/path/to/output/folder" #where you want the split datasets saved. one will be created if it does not exist or none is set

split_folders.ratio(input_folder, output=output, seed=42, ratio=(.8, .1, .1)) # ratio of split are in order of train/val/test. You can change to whatever you want. For train/val sets only, you could do .75, .25 for example.
Run Code Online (Sandbox Code Playgroud)

但是,我强烈建议对上面提供的答案进行编码,因为它们可以帮助您学习。


nom*_*mow 9

这应该这样做。它将计算每个文件夹中有多少图像,然后相应地拆分它们,将测试数据保存在具有相同结构的不同文件夹中。将代码保存在main.py文件中并运行命令:

python3 main.py ----data_path=/path1 --test_data_path_to_save=/path2 --train_ratio=0.7

import shutil
import os
import numpy as np
import argparse

def get_files_from_folder(path):

    files = os.listdir(path)
    return np.asarray(files)

def main(path_to_data, path_to_test_data, train_ratio):
    # get dirs
    _, dirs, _ = next(os.walk(path_to_data))

    # calculates how many train data per class
    data_counter_per_class = np.zeros((len(dirs)))
    for i in range(len(dirs)):
        path = os.path.join(path_to_data, dirs[i])
        files = get_files_from_folder(path)
        data_counter_per_class[i] = len(files)
    test_counter = np.round(data_counter_per_class * (1 - train_ratio))

    # transfers files
    for i in range(len(dirs)):
        path_to_original = os.path.join(path_to_data, dirs[i])
        path_to_save = os.path.join(path_to_test_data, dirs[i])

        #creates dir
        if not os.path.exists(path_to_save):
            os.makedirs(path_to_save)
        files = get_files_from_folder(path_to_original)
        # moves data
        for j in range(int(test_counter[i])):
            dst = os.path.join(path_to_save, files[j])
            src = os.path.join(path_to_original, files[j])
            shutil.move(src, dst)


def parse_args():
  parser = argparse.ArgumentParser(description="Dataset divider")
  parser.add_argument("--data_path", required=True,
    help="Path to data")
  parser.add_argument("--test_data_path_to_save", required=True,
    help="Path to test data where to save")
  parser.add_argument("--train_ratio", required=True,
    help="Train ratio - 0.7 means splitting data in 70 % train and 30 % test")
  return parser.parse_args()

if __name__ == "__main__":
  args = parse_args()
  main(args.data_path, args.test_data_path_to_save, float(args.train_ratio))
Run Code Online (Sandbox Code Playgroud)


Dip*_*ant 6

** 访问此链接https://www.kaggle.com/questions-and-answers/102677感谢 Kaggle 上的“saravanansaminathan”评论 对于具有以下文件夹结构的数据集上的相同问题。/TTSplit /0 /001_01.jpg ....... /1 /001_04.jpg ....... 我确实按照上面的链接作为参考。**

import os
import numpy as np
import shutil
import random
root_dir = '/home/dipak/Desktop/TTSplit/'
classes_dir = ['0', '1']

test_ratio = 0.20

for cls in classes_dir:
    os.makedirs(root_dir +'train/' + cls)
    os.makedirs(root_dir +'test/' + cls)

src = root_dir + cls

allFileNames = os.listdir(src)
np.random.shuffle(allFileNames)
train_FileNames, test_FileNames = np.split(np.array(allFileNames),
                                                          [int(len(allFileNames)* (1 - test_ratio))])


train_FileNames = [src+'/'+ name for name in train_FileNames.tolist()]
test_FileNames = [src+'/' + name for name in test_FileNames.tolist()]

print("*****************************")
print('Total images: ', len(allFileNames))
print('Training: ', len(train_FileNames))
print('Testing: ', len(test_FileNames))
print("*****************************")


lab = ['0', '1']

for name in train_FileNames:
    for i in lab:
        shutil.copy(name, root_dir +'train/' + i)

for name in test_FileNames:
    for i in lab:
        shutil.copy(name, root_dir +'test/' + i)
print("Copying Done!")
Run Code Online (Sandbox Code Playgroud)


小智 5

data = os.listdir(image_directory)

from sklearn.model_selection import train_test_split
train, valid = train_test_split(data, test_size=0.2, random_state=1)
Run Code Online (Sandbox Code Playgroud)

然后您可以使用shutil将图像复制到所需的文件夹中