在微调期间如何正确设置 pad token（不是 eos）以避免模型无法预测 EOS？

Question

在微调期间如何正确设置 pad token（不是 eos）以避免模型无法预测 EOS？

Cha*_*ker 7 machine-learning pytorch huggingface-transformers huggingface-tokenizers huggingface

太长了；我真正想知道的是设置 pad token 进行微调的官方方法是什么，它在原始训练期间没有设置，这样它就不会学习预测 EOS。

colab：https://colab.research.google.com/drive/1poFdFYmkR_rDM5U5Z2WWjTepMQ8h vzNc?usp=sharing

HF falcon 教程有以下行：

tokenizer.pad_token = tokenizer.eos_token

Run Code Online (Sandbox Code Playgroud)

我觉得很奇怪。pad 和 eos 是相同的，但为什么首先要在它们之间做出区分呢？

请注意，这样做 pad = eos. 这意味着在微调期间，模型永远不会被训练为输出 eos（最有可能），因为 eos 被视为填充令牌并且不会反向传播：

I just observed that when I set tokenizer.pad_token = tokenizer.eos_token during training, the model won't stop generating during inference, since it was trained to not output the eos token (per discussions above).

Run Code Online (Sandbox Code Playgroud)

我看到了这个（这里https://github.com/huggingface/transformers/issues/22794）：

tokenizer.add_special_tokens({'pad_token': '[PAD]'})

Run Code Online (Sandbox Code Playgroud)

但这假设模型有 pad_token。我认为必须进行额外的检查，确保它确实具有 pad_token 的嵌入，以便不存在运行时错误（〜从嵌入“表”/矩阵提取矩阵中的类型错误）。

但如果这样做，可能需要注意初始化新令牌，以便它主导生成： https: //nlp.stanford.edu/~johnhew/vocab-expansion.html

代码：

tokenizer.pad_token = tokenizer.eos_token

Run Code Online (Sandbox Code Playgroud)

修改模型会出现问题

该死的，这仍然不起作用：

 UserWarning: You have modified the pretrained model configuration to control generation. This is a deprecated strategy to control generation and will be removed soon, in a future version. Please use a generation configuration file (see https://huggingface.co/docs/transformers/main_classes/text_generation)

Run Code Online (Sandbox Code Playgroud)

代码：

"""
sfttrainer (likely using peft) best practices:
https://huggingface.co/docs/trl/main/en/sft_trainer#best-practices

Best practices

Pay attention to the following best practices when training a model with that trainer:

- SFTTrainer always pads by default the sequences to the max_seq_length argument of the SFTTrainer. If none is passed, the trainer will retrieve that value from the tokenizer. Some tokenizers do not provide default value, so there is a check to retrieve the minimum between 2048 and that value. Make sure to check it before training.
- For training adapters in 8bit, you might need to tweak the arguments of the prepare_model_for_int8_training method from PEFT, hence we advise users to use prepare_in_int8_kwargs field, or create the PeftModel outside the SFTTrainer and pass it.
- For a more memory-efficient training using adapters, you can load the base model in 8bit, for that simply add load_in_8bit argument when creating the SFTTrainer, or create a base model in 8bit outside the trainer and pass it.
- If you create a model outside the trainer, make sure to not pass to the trainer any additional keyword arguments that are relative to from_pretrained() method.

todo: why trust_remote_code? I want more details.
"""
import sys

import torch
from peft import LoraConfig

from transformers.modeling_utils import PreTrainedModel

from pdb import set_trace as st


def test_bfloat16_int4(compute_dtype: torch.dtype,
                       use_4bit,
                       ):
    """
python -c "import torch; print(torch.cuda.get_device_capability());"
    todo: check other code test_bfloat16() do we need use_4bit?
    """
    if compute_dtype == torch.float16 and use_4bit:
        major, _ = torch.cuda.get_device_capability()
        if major >= 8:
            print("=" * 80)
            print("Your GPU supports bfloat16, you can accelerate training with the argument --bfloat16")
            print("=" * 80)


def get_model_tokenizer_qlora_falcon7b(
        # -- mode args
        # model_id = "tiiuae/falcon-7b"
        pretrained_model_name_or_path: str = "ybelkada/falcon-7b-sharded-bf16",
        use_cache: bool = True,
        # -- lora args
        lora_alpha=16,  # todo
        lora_dropout=0.1,  # todo, evidence drop out really help? google, crfm, gpt4
        lora_r=64,  # todo
        bnb_4bit_compute_dtype=torch.float16,  # changed it from Guanaco hf

        # -- training args
        output_dir="./results",
        per_device_train_batch_size=4,
        gradient_accumulation_steps=4,
        # paging so that the sudden mem gpu spikes don't cause the run to shut down
        # (I think usually caused by too long seqs)
        # todo: why 32 bit opt?
        # todo: paged nadamw opt?
        optim="paged_adamw_32bit",
        save_steps=10,
        logging_steps=10,
        learning_rate=2e-4,
        max_grad_norm=0.3,
        max_steps=500,
        warmup_ratio=0.03,
        lr_scheduler_type="constant",
        # -- quant. args (not recommended to be changed unless you know what your doing?)
        load_in_4bit=True,  # load (usually huge) base model in 4 bits
        bnb_4bit_quant_type="nf4",  # normal float 4 for the (large) base models qlora
) -> tuple:
    """
    Load the Falcon 7B model, quantize it in 4bit and attach LoRA adapters on it.

    bf16 = 1S, 7Exp, 8Mantissa
    hypothesis: 7b trained due to 6.7 emergence rumour, I still don't think emergence is real.
    Notes:
        - ft a model is very specific to the model, tokenizer and training scheme. Thus we return
            - model, tokenizer, ft config (peft config), training args

    ref:
        - https://colab.research.google.com/drive/1DOi8MFv4SWN9NImVornZ7t6BgmLoPQO-#scrollTo=AjB0WAqFSzlD
    """
    from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, AutoTokenizer

    # - Get bnb config for bit-4 base model (bnb lib for using 4bit qlora quantization techniques by tim dettmers)
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=load_in_4bit,  # load (usually huge) base model in 4 bits
        bnb_4bit_quant_type=bnb_4bit_quant_type,  # normal float 4 for the (usually huge) base model
        bnb_4bit_compute_dtype=bnb_4bit_compute_dtype,  # if you can, during computation use bf16
    )

    # - Get falcon 4bit model
    # todo, where is this being saved & how to download quicker
    model = AutoModelForCausalLM.from_pretrained(
        pretrained_model_name_or_path=pretrained_model_name_or_path,
        quantization_config=bnb_config,
        trust_remote_code=True  # allows to execute custom code you download from the uploaded model code you are using
    )
    print(f'{type(model)=}')
    print(f'{model=}')
    # this is here to save gpu vram. Likely only needed when using 40b or when oom issues happen ref: /sf/ask/5364333481/
    model.config.use_cache = use_cache
    print(f'{type(model)=}')

    # - Get falcon tokenizer
    tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path,
                                              trust_remote_code=True)  # execs code downloaded from hf hub
    # tokenizer.pad_token = tokenizer.eos_token  # ref: /sf/ask/5364335791/
    # tokenizer.add_special_tokens({'pad_token': '[PAD]'})  # I think this is fine if during the training pad is ignored
    tokenizer.add_special_tokens({'pad_token': '<|pad|>'})  # I think this is fine if during the training pad is ignored

    # - Modify model
    # add pad token embed
    model.resize_token_embeddings(len(tokenizer))  # todo: I think this is fine if during the training pad is ignored
    model.transformer.word_embeddings.padding_idx = len(tokenizer) - 1
    model.config.max_new_tokens = len(tokenizer)
    # model.config.min_length = 1
    print(f'{model=}')
    print(f'{type(tokenizer)=}')
    print(f'{tokenizer.pad_token=}')
    # data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False) todo

    # - Get falcon lora config
    peft_config = LoraConfig(
        lora_alpha=lora_alpha,
        lora_dropout=lora_dropout,
        r=lora_r,
        bias="none",
        task_type="CAUSAL_LM",
        # model card for falcon tiiuae/falcon-7b: https://huggingface.co/tiiuae/falcon-7b/blob/main/modelling_RW.py
        # does seem to include all trainable params as done by qlora on their own paper
        target_modules=[
            # word_embeddings,
            "query_key_value",
            "dense",
            "dense_h_to_4h",
            "dense_4h_to_h",
            # "lm_head"
        ]
    )
    print(f'{type(peft_config)=}')

    # todo: print the num params of the lora = D1*r + D2*r and num of bytes by prec. (bytes) * num params
    return model, tokenizer, peft_config


# -- tests

def example_test_model_already_has_pad_token():
    """
    if it already has pad token, it likely has a small prob, so we are done.

    compare it's norm with other tokens to verify this is true.

python ~/ultimate-utils/ultimate-utils-proj-src/uutils/hf_uu/model_tokenizer/falcon_uu_mdl_tok.py
    """
    # - the get datasets todo: preprocessing, padding, streaming
    from uutils.hf_uu.data_hf.common import get_guanaco_datsets_add_splits_train_test_only
    trainset, _, testset = get_guanaco_datsets_add_splits_train_test_only()

    # qlora flacon7b
    from uutils.hf_uu.model_tokenizer.falcon_uu_mdl_tok import get_model_tokenizer_qlora_falcon7b
    model, tokenizer, peft_config = get_model_tokenizer_qlora_falcon7b()
    model: PreTrainedModel = model
    print(f'{model=}')
    sent = 'Dogs are great because they are '
    print()

    # print to see if pad tokens are present and if it ignores the tokens at the end
    encoded_input = tokenizer(sent, padding='max_length', max_length=10, return_tensors='pt')
    print(f'{encoded_input=}')

    # Print all special tokens
    print('\n---- start Print all special tokens')
    for token_name, token in tokenizer.special_tokens_map.items():
        print(f"{token_name}: {token}")
    print('\n---- end Print all special tokens')

    # Get the ID for the '[PAD]' token
    try:
        pad_token_id = tokenizer.convert_tokens_to_ids('[PAD]')
    except KeyError:
        raise ValueError("Token [PAD] is not present in the tokenizer vocabulary.")

    # Index into the model's embedding table
    try:
        print(f'{model.get_input_embeddings().weight.size()=}')
        pad_embedding = model.get_input_embeddings().weight[pad_token_id]
    except IndexError:
        raise ValueError(f"Token ID {pad_token_id} is not present in the model's embedding matrix.")

    print(f'{pad_embedding=}')
    print('Success!\n')

    # check it generates something sensible
    # tokenizer.decode(model.generate(**tokenizer(sent, return_tensors='pt'), do_sample=True)[0])
    input_ids, attention_mask = encoded_input['input_ids'], encoded_input['attention_mask']
    predicted_tokens_ids_options = model.generate(input_ids=input_ids, attention_mask=attention_mask, do_sample=True)
    predicted_tokens_ids = predicted_tokens_ids_options[0]
    predicted_sent = tokenizer.decode(predicted_tokens_ids)
    print(f'original sentence: {sent=}')
    print(f'predicted sentence: {predicted_sent=}')
    print('Success2!')


if __name__ == '__main__':
    import time

    start_time = time.time()
    example_test_model_already_has_pad_token()
    print(f"The main function executed in {time.time() - start_time} seconds.\a")

Run Code Online (Sandbox Code Playgroud)

它不喜欢对模型的修改：

    model.transformer.word_embeddings.padding_idx = len(tokenizer) - 1
    model.config.max_new_tokens = len(tokenizer)

Run Code Online (Sandbox Code Playgroud)

怎么修？

错误：

/lfs/hyperturing1/0/brando9/miniconda/envs/data_quality/lib/python3.10/site-packages/transformers/generation/utils.py:1259: UserWarning: You have modified the pretrained model configuration to control generation. This is a deprecated strategy to control generation and will be removed soon, in a future version. Please use a generation configuration file (see https://huggingface.co/docs/transformers/main_classes/text_generation)
  warnings.warn(
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.
/lfs/hyperturing1/0/brando9/miniconda/envs/data_quality/lib/python3.10/site-packages/transformers/generation/utils.py:1452: UserWarning: You are calling .generate() with the `input_ids` being on a device type different than your model's device. `input_ids` is on cpu, whereas the model is on cuda. You may experience unexpected behaviors or slower generation. Please make sure that you have put `input_ids` to the correct device by calling for example input_ids = input_ids.to('cuda') before running `.generate()`.
  warnings.warn(
Traceback (most recent call last):
  File "/lfs/hyperturing1/0/brando9/ultimate-utils/ultimate-utils-proj-src/uutils/hf_uu/model_tokenizer/falcon_uu_mdl_tok.py", line 211, in <module>
    example_test_model_already_has_pad_token()
  File "/lfs/hyperturing1/0/brando9/ultimate-utils/ultimate-utils-proj-src/uutils/hf_uu/model_tokenizer/falcon_uu_mdl_tok.py", line 199, in example_test_model_already_has_pad_token
    predicted_tokens_ids_options = model.generate(input_ids=input_ids, attention_mask=attention_mask, do_sample=True)
  File "/lfs/hyperturing1/0/brando9/miniconda/envs/data_quality/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/lfs/hyperturing1/0/brando9/miniconda/envs/data_quality/lib/python3.10/site-packages/transformers/generation/utils.py", line 1572, in generate
    return self.sample(
  File "/lfs/hyperturing1/0/brando9/miniconda/envs/data_quality/lib/python3.10/site-packages/transformers/generation/utils.py", line 2633, in sample
    next_token_scores = logits_warper(input_ids, next_token_scores)
  File "/lfs/hyperturing1/0/brando9/miniconda/envs/data_quality/lib/python3.10/site-packages/transformers/generation/logits_process.py", line 92, in __call__
    scores = processor(input_ids, scores)
  File "/lfs/hyperturing1/0/brando9/miniconda/envs/data_quality/lib/python3.10/site-packages/transformers/generation/logits_process.py", line 302, in __call__
    indices_to_remove = scores < torch.topk(scores, top_k)[0][..., -1, None]
RuntimeError: "topk_cpu" not implemented for 'Half'

Run Code Online (Sandbox Code Playgroud)

赏金部分：小型 GPT2 代码示例

是的，我同意 pad 分配给 eos。EOS还是EOS。但在微调过程中，现在 eos 的权重没有变化。这可能是一个问题，因为 eos 的概率尚未转变为微调机制。一种可能是eos产出的几率较小。是的，当我们看到 eos 时，我们仍然可以停止生产，但我们没有根据我们的微调分布将输出 eos 的概率转移——但所有其他代币都改变了分布。我认为这可能是一个问题，因为它不像旧的 eos 概率是保守的，因为除了 eos 之外所有代币概率都发生了变化+即使旧的 eos 概率是保守的，它也是错误的分布（不是微调分布）。

例如，

    if tokenizer.pad_token_id is None:
        tokenizer.pad_token = tokenizer.eos_token
...
    raw_text_batch='a'
    tokenize_batch={'input_ids': tensor([[   64, 50256, 50256, 50256, 50256]]), 'attention_mask': tensor([[1, 0, 0, 0, 0]])}

Run Code Online (Sandbox Code Playgroud)

但如果有的话会更好

    tokenize_batch={'input_ids': tensor([[   64, 50256, 50256, 50256, 50256]]), 'attention_mask': tensor([[1, 1, 0, 0, 0]])}

Run Code Online (Sandbox Code Playgroud)

代码

def test_eos_pad():
    from datasets import load_dataset
    import torch
    from transformers import GPT2Tokenizer, GPT2LMHeadModel

    raw_text_batch = 'a'

    tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
    # print(f'{tokenizer.eos_token=}')
    # print(f'{tokenizer.eos_token_id=}')
    # print(f'{tokenizer.pad_token=}')
    # print(f'{tokenizer.pad_token_id=}')

    # print(f'{raw_text_batch=}')
    # tokenize_batch = tokenizer(raw_text_batch, padding="max_length", max_length=5, truncation=True, return_tensors="pt")
    # print(f'{tokenize_batch=}')

    if tokenizer.pad_token_id is None:
        tokenizer.pad_token = tokenizer.eos_token
    probe_network = GPT2LMHeadModel.from_pretrained("gpt2")
    device = torch.device(f"cuda:{0}" if torch.cuda.is_available() else "cpu")
    probe_network = probe_network.to(device)

    print(f'{tokenizer.eos_token=}')
    print(f'{tokenizer.eos_token_id=}')
    print(f'{tokenizer.pad_token=}')
    print(f'{tokenizer.pad_token_id=}')

    print(f'{raw_text_batch=}')
    tokenize_batch = tokenizer(raw_text_batch, padding="max_length", max_length=5, truncation=True, return_tensors="pt")
    print(f'{tokenize_batch=}')
    print('Done')

Run Code Online (Sandbox Code Playgroud)

叉：

Answer 1

Cha*_*ker 0

好的，我认为这是在第一次出现时进行训练的代码eos，并确保其余的代码不接受训练（欢迎反馈）：

def collate_fn_train_only_first_eos_token_mask_everything_after_it(data: list[dict[str, str]], 
                                                                    tokenizer: PreTrainedTokenizer, 
                                                                    max_length: int=1024,  # GPT2 default, likely worth you change it! This default might cause bugs.
                                                                    ) -> dict[str, torch.Tensor]:
    """ Train only on first occurence of eos. The remaining eos are masked out.

    Sometimes the model might not have a padding token. Sometimes people set the padding token to be the eos token.
    But sometimes this seems to lead to the model to predict eos token to much. 
    So instead of actually using the pad token that was set to the eos token, we instead mask out all excesive eos tokens that act as pads 
    and leave the first eos token at the end to be predicted -- since that is the only one that semantically means end of sequence 
    and therby by not training on random eos at the end by masking it not unncesserily shift/amplify the distribution of eos. 
    
    ref: https://discuss.huggingface.co/t/why-does-the-falcon-qlora-tutorial-code-use-eos-token-as-pad-token/45954/13?u=brando 
    ref: https://chat.openai.com/share/02d16770-a1f3-4bf4-8fc2-464286daa8a1
    ref: https://claude.ai/chat/80565d1f-ece3-4fad-87df-364ce57aec15 on when to call .clone()
    """
    # we are training full context length for llama so remove code bellow, if it tries to pad hopefully it throws an error
    # -- Ensure tokenizer has a padding token
    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token
    # -- Extract sequences
    # sequences: list[str] = [example.get("text", "") or "" for example in data]
    sequences: list[str] = []
    for idx, example in enumerate(data):
        # Retrieve the value for "text" from the dictionary or default to an empty string if not present or falsy. ref: https://chat.openai.com/share/bead51fe-2acf-4f05-b8f7-b849134bbfd4
        text: str = example.get("text", "") or ""
        sequences.append(text)
    # -- Tokenize the sequences
    tokenized_data = tokenizer(sequences, padding="max_length", max_length=max_length, truncation=True, return_tensors="pt")
    tokenized_data["labels"] = tokenized_data["input_ids"].clone()  # labels is hardcoded in HF so put it!
    # -- Set the mask value for the first eos_token in each sequence to 1 and remaining to -100
    eos_token_id = tokenizer.eos_token_id
    for idx, input_ids in enumerate(tokenized_data["input_ids"]):
        # Find all occurrences of eos_token
        eos_positions = (input_ids == eos_token_id).nonzero(as_tuple=True)[0]
        if eos_positions.nelement() > 0:  # Check if eos_token is present
            first_eos_position = eos_positions[0]
            tokenized_data["attention_mask"][idx, first_eos_position] = 1  # Set the mask value to 1
            
            # Assert that the label for the first occurrence of eos_token is eos_token_id
            assert tokenized_data["labels"][idx, first_eos_position] == eos_token_id, "The label for the first eos_token is incorrect!"
            
            # For all subsequent occurrences of eos_token, set their labels to -100
            for subsequent_eos_position in eos_positions[1:]:
                tokenized_data["labels"][idx, subsequent_eos_position] = -100
                assert tokenized_data["labels"][idx, subsequent_eos_position] == -100, "The label for the subsequent_eos_position incorrect! Should be -100."
    return tokenized_data

Run Code Online (Sandbox Code Playgroud)

参考：falcon QLoRA教程代码为什么使用eos_token作为pad_token？

归档时间：	2 年，2 月前
查看次数：	4274 次
最近记录：	1 年，7 月前

在微调期间如何正确设置 pad token（不是 eos）以避免模型无法预测 EOS？

**太长了；我真正想知道的是设置 pad token 进行微调的官方方法是什么，它在原始训练期间没有设置，这样它就不会学习预测 EOS。**

修改模型会出现问题

赏金部分：小型 GPT2 代码示例

太长了；我真正想知道的是设置 pad token 进行微调的官方方法是什么，它在原始训练期间没有设置，这样它就不会学习预测 EOS。