我正在使用我自己的图像进行多类分类任务。
filenames = [] # a list of filenames
labels = [] # a list of labels corresponding to the filenames
full_ds = tf.data.Dataset.from_tensor_slices((filenames, labels))
Run Code Online (Sandbox Code Playgroud)
这个完整的数据集将被打乱并分为训练数据集、有效数据集和测试数据集
full_ds_size = len(filenames)
full_ds = full_ds.shuffle(buffer_size=full_ds_size*2, seed=128) # seed is used for reproducibility
train_ds_size = int(0.64 * full_ds_size)
valid_ds_size = int(0.16 * full_ds_size)
train_ds = full_ds.take(train_ds_size)
remaining = full_ds.skip(train_ds_size)
valid_ds = remaining.take(valid_ds_size)
test_ds = remaining.skip(valid_ds_size)
Run Code Online (Sandbox Code Playgroud)
现在我正在努力理解每个类在train_ds、valid_ds和test_ds中是如何分布的。一个丑陋的解决方案是迭代数据集中的所有元素并计算每个类的出现次数。有没有更好的办法解决呢?
我的丑陋的解决方案:
def get_class_distribution(dataset):
class_distribution = {}
for element in dataset.as_numpy_iterator():
label = element[1]
if label in class_distribution.keys():
class_distribution[label] …Run Code Online (Sandbox Code Playgroud)