Sur*_*ian 7 machine-learning computer-vision deep-learning pytorch
我需要在 PyTorch 中实现多标签图像分类模型。然而我的数据不平衡,所以我使用WeightedRandomSampler
PyTorch 中的创建自定义数据加载器。但是当我迭代自定义数据加载器时,我收到错误:IndexError: list index out of range
def make_weights_for_balanced_classes(images, nclasses):
count = [0] * nclasses
for item in images:
count[item[1]] += 1
weight_per_class = [0.] * nclasses
N = float(sum(count))
for i in range(nclasses):
weight_per_class[i] = N/float(count[i])
weight = [0] * len(images)
for idx, val in enumerate(images):
weight[idx] = weight_per_class[val[1]]
return weight
Run Code Online (Sandbox Code Playgroud)
weights = make_weights_for_balanced_classes(train_dataset.imgs, len(full_dataset.classes))
weights = torch.DoubleTensor(weights)
sampler = WeightedRandomSampler(weights, len(weights))
train_loader = DataLoader(train_dataset, batch_size=4,sampler = sampler, pin_memory=True)
Run Code Online (Sandbox Code Playgroud)
根据/sf/answers/4256944681/中的答案,以下是我更新的代码。但是当我创建一个数据加载器时,loader = DataLoader(full_dataset, batch_size=4, sampler=sampler)
也len(loader)
返回 1。
class_counts = [1691, 743, 2278, 1271]
num_samples = np.sum(class_counts)
labels = [tag for _,tag in full_dataset.imgs]
class_weights = [num_samples/class_counts[i] for i in range(len(class_counts)]
weights = [class_weights[labels[i]] for i in range(num_samples)]
sampler = WeightedRandomSampler(torch.DoubleTensor(weights), num_samples)
Run Code Online (Sandbox Code Playgroud)
预先非常感谢!
我根据下面接受的答案添加了一个实用函数:
def sampler_(dataset):
dataset_counts = imageCount(dataset)
num_samples = sum(dataset_counts)
labels = [tag for _,tag in dataset]
class_weights = [num_samples/dataset_counts[i] for i in range(n_classes)]
weights = [class_weights[labels[i]] for i in range(num_samples)]
sampler = WeightedRandomSampler(torch.DoubleTensor(weights), int(num_samples))
return sampler
Run Code Online (Sandbox Code Playgroud)
imageCount 函数查找数据集中每个类别的图像数量。数据集中的每一行都包含图像和类,因此我们考虑元组中的第二个元素。
def imageCount(dataset):
image_count = [0]*(n_classes)
for img in dataset:
image_count[img[1]] += 1
return image_count
Run Code Online (Sandbox Code Playgroud)
该代码看起来有点复杂......您可以尝试以下操作:
#Let there be 9 samples and 1 sample in class 0 and 1 respectively
class_counts = [9.0, 1.0]
num_samples = sum(class_counts)
labels = [0, 0,..., 0, 1] #corresponding labels of samples
class_weights = [num_samples/class_counts[i] for i in range(len(class_counts))]
weights = [class_weights[labels[i]] for i in range(int(num_samples))]
sampler = WeightedRandomSampler(torch.DoubleTensor(weights), int(num_samples))
Run Code Online (Sandbox Code Playgroud)
Here is an alternative solution:
import numpy as np
from torch.utils.data.sampler import WeightedRandomSampler
counts = np.bincount(y)
labels_weights = 1. / counts
weights = labels_weights[y]
WeightedRandomSampler(weights, len(weights))
Run Code Online (Sandbox Code Playgroud)
where y
is a list of labels corresponding to each sample, has shape (n_samples,)
and are encoded [0, ..., n_classes]
.
weights
won't add up to 1, which is ok according to the official docs.