使用 HuggingFace 数据集函数将数据集拆分为训练、测试和验证

Rap*_*tor 5 python huggingface-datasets

我可以使用以下方法将数据集按 80%:20% 的比例分割为训练集和测试集:

from datasets import load_dataset
ds = load_dataset("myusername/mycorpus")
ds = ds["train"].train_test_split(test_size=0.2) # my data in HF have 1 train split only
print(ds)
Run Code Online (Sandbox Code Playgroud)

其输出:

DatasetDict({
    train: Dataset({
        features: ['translation'],
        num_rows: 62044
    })
    test: Dataset({
        features: ['translation'],
        num_rows: 15512
    })
})
Run Code Online (Sandbox Code Playgroud)

如何生成比例为 80%:10%:10% 的验证拆分?

小智 5

from datasets import *
ds = load_dataset("myusername/mycorpus")

train_testvalid = ds['train'].train_test_split(test_size=0.2)
# Split the 10% test + valid in half test, half valid
test_valid = train_testvalid['test'].train_test_split(test_size=0.5)
# gather everyone if you want to have a single DatasetDict
ds = DatasetDict({
    'train': train_testvalid['train'],
    'test': test_valid['test'],
    'valid': test_valid['train']})
Run Code Online (Sandbox Code Playgroud)

这将输出具有以下结构的数据集

DatasetDict({
    train: Dataset({
        features: ['translation'],
        num_rows: 62044
    })
    test: Dataset({
        features: ['translation'],
        num_rows: 7756
    })

valid: Dataset({
    features: ['translation'],
    num_rows: 7756
})

})
Run Code Online (Sandbox Code Playgroud)

希望这对你有帮助