HuggingFace Transformers 模型配置报告“这是一种已弃用的控制生成策略,很快就会被删除”

Rap*_*tor 7 python huggingface-transformers

我正在使用 HuggingFace Transformers 训练序列到序列模型Seq2SeqTrainer。当我执行训练过程时,它报告以下警告:

/path/to/python3.9/site-packages/transformers/ Generation/utils.py:1219:UserWarning:您已修改预训练模型配置以控制生成。这是一种已弃用的控制生成策略,并将很快在未来版本中删除。请使用生成配置文件(请参阅https://huggingface.co/docs/transformers/main_classes/text_ Generation )

请注意,HuggingFace 文档链接已失效。

我使用以下代码:

model = BartForConditionalGeneration.from_pretrained(checkpoint)
model.config.output_attentions = True
model.config.output_hidden_states = True

training_args = Seq2SeqTrainingArguments(
    output_dir = "output_dir_here",
    evaluation_strategy = IntervalStrategy.STEPS, #"epoch",
    optim = "adamw_torch", # Use new PyTorch optimizer
    eval_steps = 1000, # New
    logging_steps = 1000,
    save_steps = 1000,
    learning_rate = 2e-5,
    per_device_train_batch_size = batch_size,
    per_device_eval_batch_size = batch_size,
    weight_decay = 0.01,
    save_total_limit = 3,
    num_train_epochs = 30,
    predict_with_generate=True,
    remove_unused_columns=True,
    fp16 = True,
    push_to_hub = True,
    metric_for_best_model = 'bleu', # New or "f1"
    load_best_model_at_end = True # New
)

trainer = Seq2SeqTrainer(
    model = model,
    args = training_args,
    train_dataset = train_ds,
    eval_dataset = eval_ds,
    tokenizer = tokenizer,
    data_collator = data_collator,
    compute_metrics = compute_metrics,
    callbacks = [EarlyStoppingCallback(early_stopping_patience=3)]
)

trainer.train()
Run Code Online (Sandbox Code Playgroud)

训练过程可以毫无问题地完成,但我担心弃用警告。我应该如何修改代码来解决问题?

版本:

  • 变形金刚4.28.1
  • Python 3.9.7

Mac*_*ski 3

根本原因

这是关于以过时的方式使用 API 的警告(=很快不受支持)。然而,到目前为止,代码正在自行修复这个问题 - 因此只是一个警告而不是一个破坏性错误。

请参阅源代码中的这些行

补救

transformers鼓励使用配置文件。在这种情况下,我们需要GenerationConfig尽早传递一个对象,而不是设置属性。


我首先分享一个干净、简单的例子:

from transformers import AutoTokenizer, BartForConditionalGeneration

model = BartForConditionalGeneration.from_pretrained("facebook/bart-large-cnn")
tokenizer = AutoTokenizer.from_pretrained("facebook/bart-large-cnn")

ARTICLE_TO_SUMMARIZE = (
    "PG&E stated it scheduled the blackouts in response to forecasts for high winds "
    "amid dry conditions. The aim is to reduce the risk of wildfires. Nearly 800 thousand customers were "
    "scheduled to be affected by the shutoffs which were expected to last through at least midday tomorrow."
)
inputs = tokenizer([ARTICLE_TO_SUMMARIZE], max_length=1024, return_tensors="pt")

# change config and generate summary

from transformers.generation import GenerationConfig

model.config.max_new_tokens = 10
model.config.min_length = 1
gen_cfg = GenerationConfig.from_model_config(model.config)
gen_cfg.max_new_tokens = 10
gen_cfg.min_length = 1

summary_ids = model.generate(inputs["input_ids"], generation_config=gen_cfg)
tokenizer.batch_decode(summary_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
Run Code Online (Sandbox Code Playgroud)

如果您尝试config直接操作属性并且不传递任何配置,您会收到警告。如果你通过了 GenerationConfig,那么一切都好。此示例可在此处作为 Colab 笔记本重现。


现在,回到原来的问题。请注意,一般来说,由于不兼容的原因,不建议更改预训练模型的架构配置有时,通过额外的努力这是可能的。但是,在初始化时可以更改某些配置

model = BartForConditionalGeneration.from_pretrained(
     "facebook/bart-large-cnn", 
     attention_dropout=0.123
)
Run Code Online (Sandbox Code Playgroud)

这是完整工作的代码,已针对可重复性进行了纠正,另请参阅此笔记本

from transformers import AutoTokenizer, BartForConditionalGeneration
from transformers.generation import GenerationConfig
from transformers import Trainer, TrainingArguments
from transformers.models.bart.modeling_bart import shift_tokens_right
from transformers import DataCollatorForSeq2Seq

model = BartForConditionalGeneration.from_pretrained("facebook/bart-large-cnn", attention_dropout=0.123)
tokenizer = AutoTokenizer.from_pretrained("facebook/bart-large-cnn")
seq2seq_data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

def get_features(batch):
    input_encodings = tokenizer(batch["text"], max_length=1024, truncation=True)
    
    with tokenizer.as_target_tokenizer():
        target_encodings = tokenizer(batch["summary"], max_length=256, truncation=True)
        
    return {"input_ids": input_encodings["input_ids"], 
           "attention_mask": input_encodings["attention_mask"], 
           "labels": target_encodings["input_ids"]}

dataset_ftrs = dataset.map(get_features, batched=True)
columns = ['input_ids', 'labels', 'input_ids','attention_mask',] 
dataset_ftrs.set_format(type='torch', columns=columns)

training_args = TrainingArguments(
    output_dir='./models/bart-summarizer',          
    num_train_epochs=1,           
    per_device_train_batch_size=1, 
    per_device_eval_batch_size=1,   
    warmup_steps=500,               
    weight_decay=0.01,              
    logging_dir='./logs',          
)

model.config.output_attentions = True
model.config.output_hidden_states = True

training_args = TrainingArguments(
    output_dir='./models/bart-summarizer', 
    num_train_epochs=1, 
    warmup_steps=500,                                  
    per_device_train_batch_size=1, 
    per_device_eval_batch_size=1, 
    weight_decay=0.01, 
    logging_steps=10, 
    push_to_hub=False, 
    evaluation_strategy='steps', 
    eval_steps=500, 
    save_steps=1e6, 
    gradient_accumulation_steps=16,
)

trainer = Trainer(
    model=model, 
    args=training_args, 
    tokenizer=tokenizer,                  
    data_collator=seq2seq_data_collator,                  
    train_dataset=dataset_ftrs["train"],                  
    eval_dataset=dataset_ftrs["test"],
)

assert model.config.attention_dropout==0.123

#trainer.train()
Run Code Online (Sandbox Code Playgroud)