nir*_*air 29 python deep-learning pytorch
import pandas as pd
import numpy as np
import cv2
from torch.utils.data.dataset import Dataset
class CustomDatasetFromCSV(Dataset):
def __init__(self, csv_path, transform=None):
self.data = pd.read_csv(csv_path)
self.labels = pd.get_dummies(self.data['emotion']).as_matrix()
self.height = 48
self.width = 48
self.transform = transform
def __getitem__(self, index):
pixels = self.data['pixels'].tolist()
faces = []
for pixel_sequence in pixels:
face = [int(pixel) for pixel in pixel_sequence.split(' ')]
# print(np.asarray(face).shape)
face = np.asarray(face).reshape(self.width, self.height)
face = cv2.resize(face.astype('uint8'), (self.width, self.height))
faces.append(face.astype('float32'))
faces = np.asarray(faces)
faces = np.expand_dims(faces, -1)
return faces, self.labels
def __len__(self):
return len(self.data)
Run Code Online (Sandbox Code Playgroud)
这是我可以通过使用其他存储库的引用来设法做到的.但是,我想将此数据集拆分为训练和测试.
我怎么能在这堂课里做到这一点?或者我需要单独上课才能做到这一点?
Fáb*_*rez 68
从PyTorch 0.4.1开始,您可以使用random_split
:
train_size = int(0.8 * len(full_dataset))
test_size = len(full_dataset) - train_size
train_dataset, test_dataset = torch.utils.data.random_split(full_dataset, [train_size, test_size])
Run Code Online (Sandbox Code Playgroud)
ben*_*che 46
使用Pytorch的SubsetRandomSampler
:
import torch
import numpy as np
from torchvision import datasets
from torchvision import transforms
from torch.utils.data.sampler import SubsetRandomSampler
class CustomDatasetFromCSV(Dataset):
def __init__(self, csv_path, transform=None):
self.data = pd.read_csv(csv_path)
self.labels = pd.get_dummies(self.data['emotion']).as_matrix()
self.height = 48
self.width = 48
self.transform = transform
def __getitem__(self, index):
# This method should return only 1 sample and label
# (according to "index"), not the whole dataset
# So probably something like this for you:
pixel_sequence = self.data['pixels'][index]
face = [int(pixel) for pixel in pixel_sequence.split(' ')]
face = np.asarray(face).reshape(self.width, self.height)
face = cv2.resize(face.astype('uint8'), (self.width, self.height))
label = self.labels[index]
return face, label
def __len__(self):
return len(self.labels)
dataset = CustomDatasetFromCSV(my_path)
batch_size = 16
validation_split = .2
shuffle_dataset = True
random_seed= 42
# Creating data indices for training and validation splits:
dataset_size = len(dataset)
indices = list(range(dataset_size))
split = int(np.floor(validation_split * dataset_size))
if shuffle_dataset :
np.random.seed(random_seed)
np.random.shuffle(indices)
train_indices, val_indices = indices[split:], indices[:split]
# Creating PT data samplers and loaders:
train_sampler = SubsetRandomSampler(train_indices)
valid_sampler = SubsetRandomSampler(val_indices)
train_loader = torch.utils.data.DataLoader(dataset, batch_size=batch_size,
sampler=train_sampler)
validation_loader = torch.utils.data.DataLoader(dataset, batch_size=batch_size,
sampler=valid_sampler)
# Usage Example:
num_epochs = 10
for epoch in range(num_epochs):
# Train:
for batch_index, (faces, labels) in enumerate(train_loader):
# ...
Run Code Online (Sandbox Code Playgroud)
Eri*_*ric 25
如果您想确保您的 split 具有平衡的类,您可以使用train_test_split
from sklearn
。
假设您已将您的数据包装data
在自定义数据集对象中:
from torch.utils.data import DataLoader, Subset
from sklearn.model_selection import train_test_split
TEST_SIZE = 0.1
BATCH_SIZE = 64
SEED = 42
# generate indices: instead of the actual data we pass in integers instead
train_indices, test_indices, _, _ = train_test_split(
range(len(data)),
data.targets,
stratify=data.targets,
test_size=TEST_SIZE,
random_state=SEED
)
# generate subset based on indices
train_split = Subset(data, train_indices)
test_split = Subset(data, test_indices)
# create batches
train_batches = DataLoader(train_split, batch_size=BATCH_SIZE, shuffle=True)
test_batches = DataLoader(test_split, batch_size=BATCH_SIZE)
Run Code Online (Sandbox Code Playgroud)
Shi*_*hah 12
当前的答案是随机分裂,其缺点是不保证每类的样本数量是平衡的.当您希望每个类别拥有少量样本时,这尤其成问题.例如,MNIST有60,000个例子,即每个数字6000个.假设您的训练集中每个数字只需要30个示例.在这种情况下,随机分割可能会产生类别之间的不平衡(一个数字与更多的训练数据相比).所以你想确保每个数字只有30个标签.这称为分层抽样.
一种方法是在Pytorch中使用sampler接口,示例代码在这里.
另一种方法是破解你的方式:).例如,下面是MNIST的简单实现,其中ds
是MNIST数据集,并且k
是每个类所需的样本数.
def sampleFromClass(ds, k):
class_counts = {}
train_data = []
train_label = []
test_data = []
test_label = []
for data, label in ds:
c = label.item()
class_counts[c] = class_counts.get(c, 0) + 1
if class_counts[c] <= k:
train_data.append(data)
train_label.append(torch.unsqueeze(label, 0))
else:
test_data.append(data)
test_label.append(torch.unsqueeze(label, 0))
train_data = torch.cat(train_data)
for ll in train_label:
print(ll)
train_label = torch.cat(train_label)
test_data = torch.cat(test_data)
test_label = torch.cat(test_label)
return (TensorDataset(train_data, train_label),
TensorDataset(test_data, test_label))
Run Code Online (Sandbox Code Playgroud)
你可以像这样使用这个函数:
def main():
train_ds = datasets.MNIST('../data', train=True, download=True,
transform=transforms.Compose([
transforms.ToTensor()
]))
train_ds, test_ds = sampleFromClass(train_ds, 3)
Run Code Online (Sandbox Code Playgroud)
Subset
这是附加的包含该方法的PyTorch类random_split
。请注意,此方法是SubsetRandomSampler
.
对于 MNIST,如果我们使用random_split
:
loader = DataLoader(
torchvision.datasets.MNIST('/data/mnist', train=True, download=True,
transform=torchvision.transforms.Compose([
torchvision.transforms.ToTensor(),
torchvision.transforms.Normalize(
(0.5,), (0.5,))
])),
batch_size=16, shuffle=False)
print(loader.dataset.data.shape)
test_ds, valid_ds = torch.utils.data.random_split(loader.dataset, (50000, 10000))
print(test_ds, valid_ds)
print(test_ds.indices, valid_ds.indices)
print(test_ds.indices.shape, valid_ds.indices.shape)
Run Code Online (Sandbox Code Playgroud)
我们得到:
torch.Size([60000, 28, 28])
<torch.utils.data.dataset.Subset object at 0x0000020FD1880B00> <torch.utils.data.dataset.Subset object at 0x0000020FD1880C50>
tensor([ 1520, 4155, 45472, ..., 37969, 45782, 34080]) tensor([ 9133, 51600, 22067, ..., 3950, 37306, 31400])
torch.Size([50000]) torch.Size([10000])
Run Code Online (Sandbox Code Playgroud)
我们的test_ds.indices
和valid_ds.indices
将从范围内随机出现(0, 600000)
。但是,如果我想从中获取索引序列,(0, 49999)
不幸的是,除了这种(50000, 59999)
方式之外,我目前无法做到这一点。
如果您运行MNIST 基准测试,那么它会很方便,其中预定义了测试数据集和验证数据集。
归档时间: |
|
查看次数: |
37260 次 |
最近记录: |