Rap*_*tor 5 python huggingface-datasets
我可以使用以下方法将数据集按 80%:20% 的比例分割为训练集和测试集:
from datasets import load_dataset
ds = load_dataset("myusername/mycorpus")
ds = ds["train"].train_test_split(test_size=0.2) # my data in HF have 1 train split only
print(ds)
Run Code Online (Sandbox Code Playgroud)
其输出:
DatasetDict({
train: Dataset({
features: ['translation'],
num_rows: 62044
})
test: Dataset({
features: ['translation'],
num_rows: 15512
})
})
Run Code Online (Sandbox Code Playgroud)
如何生成比例为 80%:10%:10% 的验证拆分?
小智 5
from datasets import *
ds = load_dataset("myusername/mycorpus")
train_testvalid = ds['train'].train_test_split(test_size=0.2)
# Split the 10% test + valid in half test, half valid
test_valid = train_testvalid['test'].train_test_split(test_size=0.5)
# gather everyone if you want to have a single DatasetDict
ds = DatasetDict({
'train': train_testvalid['train'],
'test': test_valid['test'],
'valid': test_valid['train']})
Run Code Online (Sandbox Code Playgroud)
这将输出具有以下结构的数据集
DatasetDict({
train: Dataset({
features: ['translation'],
num_rows: 62044
})
test: Dataset({
features: ['translation'],
num_rows: 7756
})
valid: Dataset({
features: ['translation'],
num_rows: 7756
})
})
Run Code Online (Sandbox Code Playgroud)
希望这对你有帮助