Rap*_*tor 7 python huggingface-transformers
我正在使用 HuggingFace Transformers 训练序列到序列模型Seq2SeqTrainer。当我执行训练过程时,它报告以下警告:
/path/to/python3.9/site-packages/transformers/ Generation/utils.py:1219:UserWarning:您已修改预训练模型配置以控制生成。这是一种已弃用的控制生成策略,并将很快在未来版本中删除。请使用生成配置文件(请参阅https://huggingface.co/docs/transformers/main_classes/text_ Generation )
请注意,HuggingFace 文档链接已失效。
我使用以下代码:
model = BartForConditionalGeneration.from_pretrained(checkpoint)
model.config.output_attentions = True
model.config.output_hidden_states = True
training_args = Seq2SeqTrainingArguments(
output_dir = "output_dir_here",
evaluation_strategy = IntervalStrategy.STEPS, #"epoch",
optim = "adamw_torch", # Use new PyTorch optimizer
eval_steps = 1000, # New
logging_steps = 1000,
save_steps = 1000,
learning_rate = 2e-5,
per_device_train_batch_size = batch_size,
per_device_eval_batch_size = batch_size,
weight_decay = 0.01,
save_total_limit = 3,
num_train_epochs = 30,
predict_with_generate=True,
remove_unused_columns=True,
fp16 = True,
push_to_hub = True,
metric_for_best_model = 'bleu', # New or "f1"
load_best_model_at_end = True # New
)
trainer = Seq2SeqTrainer(
model = model,
args = training_args,
train_dataset = train_ds,
eval_dataset = eval_ds,
tokenizer = tokenizer,
data_collator = data_collator,
compute_metrics = compute_metrics,
callbacks = [EarlyStoppingCallback(early_stopping_patience=3)]
)
trainer.train()
Run Code Online (Sandbox Code Playgroud)
训练过程可以毫无问题地完成,但我担心弃用警告。我应该如何修改代码来解决问题?
版本:
根本原因
这是关于以过时的方式使用 API 的警告(=很快不受支持)。然而,到目前为止,代码正在自行修复这个问题 - 因此只是一个警告而不是一个破坏性错误。
请参阅源代码中的这些行。
补救
该transformers库鼓励使用配置文件。在这种情况下,我们需要GenerationConfig尽早传递一个对象,而不是设置属性。
我首先分享一个干净、简单的例子:
from transformers import AutoTokenizer, BartForConditionalGeneration
model = BartForConditionalGeneration.from_pretrained("facebook/bart-large-cnn")
tokenizer = AutoTokenizer.from_pretrained("facebook/bart-large-cnn")
ARTICLE_TO_SUMMARIZE = (
"PG&E stated it scheduled the blackouts in response to forecasts for high winds "
"amid dry conditions. The aim is to reduce the risk of wildfires. Nearly 800 thousand customers were "
"scheduled to be affected by the shutoffs which were expected to last through at least midday tomorrow."
)
inputs = tokenizer([ARTICLE_TO_SUMMARIZE], max_length=1024, return_tensors="pt")
# change config and generate summary
from transformers.generation import GenerationConfig
model.config.max_new_tokens = 10
model.config.min_length = 1
gen_cfg = GenerationConfig.from_model_config(model.config)
gen_cfg.max_new_tokens = 10
gen_cfg.min_length = 1
summary_ids = model.generate(inputs["input_ids"], generation_config=gen_cfg)
tokenizer.batch_decode(summary_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
Run Code Online (Sandbox Code Playgroud)
如果您尝试config直接操作属性并且不传递任何配置,您会收到警告。如果你通过了 GenerationConfig,那么一切都好。此示例可在此处作为 Colab 笔记本重现。
现在,回到原来的问题。请注意,一般来说,由于不兼容的原因,不建议更改预训练模型的架构配置。有时,通过额外的努力这是可能的。但是,在初始化时可以更改某些配置:
model = BartForConditionalGeneration.from_pretrained(
"facebook/bart-large-cnn",
attention_dropout=0.123
)
Run Code Online (Sandbox Code Playgroud)
这是完整工作的代码,已针对可重复性进行了纠正,另请参阅此笔记本
from transformers import AutoTokenizer, BartForConditionalGeneration
from transformers.generation import GenerationConfig
from transformers import Trainer, TrainingArguments
from transformers.models.bart.modeling_bart import shift_tokens_right
from transformers import DataCollatorForSeq2Seq
model = BartForConditionalGeneration.from_pretrained("facebook/bart-large-cnn", attention_dropout=0.123)
tokenizer = AutoTokenizer.from_pretrained("facebook/bart-large-cnn")
seq2seq_data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)
def get_features(batch):
input_encodings = tokenizer(batch["text"], max_length=1024, truncation=True)
with tokenizer.as_target_tokenizer():
target_encodings = tokenizer(batch["summary"], max_length=256, truncation=True)
return {"input_ids": input_encodings["input_ids"],
"attention_mask": input_encodings["attention_mask"],
"labels": target_encodings["input_ids"]}
dataset_ftrs = dataset.map(get_features, batched=True)
columns = ['input_ids', 'labels', 'input_ids','attention_mask',]
dataset_ftrs.set_format(type='torch', columns=columns)
training_args = TrainingArguments(
output_dir='./models/bart-summarizer',
num_train_epochs=1,
per_device_train_batch_size=1,
per_device_eval_batch_size=1,
warmup_steps=500,
weight_decay=0.01,
logging_dir='./logs',
)
model.config.output_attentions = True
model.config.output_hidden_states = True
training_args = TrainingArguments(
output_dir='./models/bart-summarizer',
num_train_epochs=1,
warmup_steps=500,
per_device_train_batch_size=1,
per_device_eval_batch_size=1,
weight_decay=0.01,
logging_steps=10,
push_to_hub=False,
evaluation_strategy='steps',
eval_steps=500,
save_steps=1e6,
gradient_accumulation_steps=16,
)
trainer = Trainer(
model=model,
args=training_args,
tokenizer=tokenizer,
data_collator=seq2seq_data_collator,
train_dataset=dataset_ftrs["train"],
eval_dataset=dataset_ftrs["test"],
)
assert model.config.attention_dropout==0.123
#trainer.train()
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
5124 次 |
| 最近记录: |