了解 GPU 使用情况 Huggingface 分类 - 总优化步骤

use*_*622 3 python nlp gpu huggingface-transformers

我正在为分类问题训练 Huggingface Longformer 并得到以下输出。

  1. 我很困惑Total optimization steps。由于我有 7000 个训练数据点和 5 个时期,并且Total train batch size (w. parallel, distributed & accumulation) = 64,我不应该获取 7000*5/64步骤吗?那到了546.875?为什么显示 Total optimization steps = 545

  2. 为什么在下面的输出中,有 16 个Input ids are automatically padded from 1500 to 1536 to be a multiple of config.attention_window: 512步骤 [ 23/545 14:24 < 5:58:16, 0.02 it/s, Epoch 0.20/5]?这些步骤是什么?

=================================================== ========

***** Running training *****
  Num examples = 7000
  Num Epochs = 5
  Instantaneous batch size per device = 4
  Total train batch size (w. parallel, distributed & accumulation) = 64
  Gradient Accumulation steps = 16
  Total optimization steps = 545
Initializing global attention on CLS token...
Input ids are automatically padded from 1500 to 1536 to be a multiple of `config.attention_window`: 512
Initializing global attention on CLS token...
Input ids are automatically padded from 1500 to 1536 to be a multiple of `config.attention_window`: 512
Initializing global attention on CLS token...
Input ids are automatically padded from 1500 to 1536 to be a multiple of `config.attention_window`: 512
Initializing global attention on CLS token...
Input ids are automatically padded from 1500 to 1536 to be a multiple of `config.attention_window`: 512
Initializing global attention on CLS token...
Input ids are automatically padded from 1500 to 1536 to be a multiple of `config.attention_window`: 512
Initializing global attention on CLS token...
Input ids are automatically padded from 1500 to 1536 to be a multiple of `config.attention_window`: 512
Initializing global attention on CLS token...
Input ids are automatically padded from 1500 to 1536 to be a multiple of `config.attention_window`: 512
Initializing global attention on CLS token...
Input ids are automatically padded from 1500 to 1536 to be a multiple of `config.attention_window`: 512
Initializing global attention on CLS token...
Input ids are automatically padded from 1500 to 1536 to be a multiple of `config.attention_window`: 512
Initializing global attention on CLS token...
Input ids are automatically padded from 1500 to 1536 to be a multiple of `config.attention_window`: 512
Initializing global attention on CLS token...
Input ids are automatically padded from 1500 to 1536 to be a multiple of `config.attention_window`: 512
Initializing global attention on CLS token...
Input ids are automatically padded from 1500 to 1536 to be a multiple of `config.attention_window`: 512
Initializing global attention on CLS token...
Input ids are automatically padded from 1500 to 1536 to be a multiple of `config.attention_window`: 512
Initializing global attention on CLS token...
Input ids are automatically padded from 1500 to 1536 to be a multiple of `config.attention_window`: 512
Initializing global attention on CLS token...
Input ids are automatically padded from 1500 to 1536 to be a multiple of `config.attention_window`: 512
Initializing global attention on CLS token...
Input ids are automatically padded from 1500 to 1536 to be a multiple of `config.attention_window`: 512
 [ 23/545 14:24 < 5:58:16, 0.02 it/s, Epoch 0.20/5]
Epoch   Training Loss   Validation Loss
Run Code Online (Sandbox Code Playgroud)

#更新

添加TrainerTrainingArguments

#class weights
class CustomTrainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False):
        labels = inputs.get("labels")
        # forward pass
        outputs = model(**inputs)
        logits = outputs.get("logits")
        # compute custom loss (suppose one has 3 labels with different weights)
        loss_fct = nn.CrossEntropyLoss(weight=torch.tensor([1.0, 0.5243])).to(device)
        loss = loss_fct(logits.view(-1, self.model.config.num_labels), labels.view(-1)).to(device)
        return (loss, outputs) if return_outputs else loss

 trainer = CustomTrainer(
        model=model,
        args=training_args,
        compute_metrics=compute_metrics,
        train_dataset=train_df_tuning_dataset_tokenized,
        eval_dataset=val_dataset_tokenized
    )



# define the training arguments
training_args = TrainingArguments(
    
    
num_train_epochs = 5,# changed this from 5
per_device_train_batch_size = 4,#4,#8,
gradient_accumulation_steps = 16,
per_device_eval_batch_size= 16,#16
evaluation_strategy = "epoch",

save_strategy = "epoch",
learning_rate=2e-5,
load_best_model_at_end=True,
greater_is_better=False,

disable_tqdm = False, 

weight_decay=0.01,
optim="adamw_torch",#removing on 18 march from huggingface example notebook
run_name = 'longformer-classification-16March2022'
)
Run Code Online (Sandbox Code Playgroud)

Ray*_*out 5

1. 为什么有545个优化步骤?

查看包的实现transformers,我们看到在方法中打印消息时Trainer使用了一个调用的变量:max_stepsTotal optimization stepstrain

logger.info("***** Running training *****")
logger.info(f"  Num examples = {num_examples}")
logger.info(f"  Num Epochs = {num_train_epochs}")
logger.info(f"  Instantaneous batch size per device = {args.per_device_train_batch_size}")
logger.info(f"  Total train batch size (w. parallel, distributed & accumulation) = {total_train_batch_size}")
logger.info(f"  Gradient Accumulation steps = {args.gradient_accumulation_steps}")
logger.info(f"  Total optimization steps = {max_steps}")
Run Code Online (Sandbox Code Playgroud)

变形金刚存储库中上述代码片段的永久链接

该方法Trainer前面有以下代码train

class Trainer:
    [...]
    def train(self) -> None:
        [Some irrelevant code ommited here...]

        total_train_batch_size = args.train_batch_size * args.gradient_accumulation_steps * args.world_size
        if train_dataset_is_sized:
            num_update_steps_per_epoch = len(train_dataloader) // args.gradient_accumulation_steps
            num_update_steps_per_epoch = max(num_update_steps_per_epoch, 1)
            if args.max_steps > 0:
                max_steps = args.max_steps
                num_train_epochs = args.max_steps // num_update_steps_per_epoch + int(
                    args.max_steps % num_update_steps_per_epoch > 0
                )
                # May be slightly incorrect if the last batch in the training datalaoder has a smaller size but it's
                # the best we can do.
                num_train_samples = args.max_steps * total_train_batch_size
            else:
                max_steps = math.ceil(args.num_train_epochs * num_update_steps_per_epoch)
                num_train_epochs = math.ceil(args.num_train_epochs)
                num_train_samples = len(self.train_dataset) * args.num_train_epochs
Run Code Online (Sandbox Code Playgroud)

变形金刚存储库中上述代码片段的永久链接

total_train_batch_size = args.train_batch_size * args.gradient_accumulation_steps * args.world_size在您的示例中将等于total_train_batch_size = 4 * 16 * 1 = 64,如预期的那样。

然后我们num_update_steps_per_epoch = len(train_dataloader) // args.gradient_accumulation_steps就得到了num_update_steps_per_epoch = len(train_dataloader) // 16

现在 a 的长度DataLoader等于该 中的批次数DataLoader。由于您有 7000 个样品,而我们有per_device_train_batch_size4 个样品,这将为我们提供7000 / 4 = 1750批次。回到num_update_steps_per_epoch我们现在有num_update_steps_per_epoch = 1750 // 16 = 109(Python 整数除法发言)

您没有指定最大步数,因此我们可以max_steps = math.ceil(args.num_train_epochs * num_update_steps_per_epoch)得到max_steps = math.ceil(5 * 109) = 545

2. 为什么填充操作会被记录16次?

在 Transformer 架构中,从技术上讲,您不必将所有样本填充为相同的长度。真正重要的是批次内的样本长度相同,不同批次的长度可能不同。

这意味着对于每个经过前向传递的批次,都会显示此消息。至于为什么该消息出现了 16 次,尽管 23 个批次实际上已经经过了前向传递,我可以想到两个可能的原因:

  1. 填充操作的日志记录和进度条的日志记录发生在两个不同的线程上,并且前者有点滞后
  2. (极不可能)您有不需要填充的批次,因为所有样本都具有相同的长度,并且该长度已经是 512 的倍数。