Huggingface GPT2 损失理解

Question

Huggingface GPT2 损失理解

Ale*_*nen 2 pytorch huggingface-transformers gpt-2

（也发布在这里https://discuss.huggingface.co/t/newbie-understanding-gpt2-loss/33590）

我对 GPT2 损失的理解陷入困境。我想为模型提供具有它将生成的目标的标签，以便我可以看到损失为零。

我有一个输入文本 input_text = "Welcome to New York" 当前模型预测下一个单词为City 如果我将标签指定为 input_text，则损失永远不会为零。我如何模拟给出“欢迎来到纽约市”标签，以便内部神经网络（无论模型如何）给出零或接近零的损失？

为了更多地解释我的意思，这是片段。

注意 - 我已阅读论坛和文档，标签可以与输入文本相同，模型将向左移动标签，并且不会计算最后一个标记的损失。但损失仍然应该为零，但事实并非如此。

语言建模的标签。请注意，标签在模型内部移动，即您可以设置 labels = input_ids...。

from transformers import GPT2LMHeadModel, GPT2Tokenizer

model_name = 'gpt2'
tokenizer = GPT2Tokenizer.from_pretrained(model_name,model_max_length=1024,padding_side='left')
tokenizer.pad_token = tokenizer.eos_token # == <|endoftext|> = 50256
model = GPT2LMHeadModel.from_pretrained(model_name)

batch_size=5
input_text  = "<|endoftext|> Welcome to New York"
target_text = "Welcome to New York City"

# encode the inputs
encoding = tokenizer(input_text,padding=True,max_length=batch_size,truncation=True,return_tensors="pt",)
input_ids, attention_mask = encoding.input_ids, encoding.attention_mask
# encode the targets
target_encoding = tokenizer(target_text,padding=True, max_length=batch_size, truncation=True,return_tensors="pt",)
labels = target_encoding.input_ids
# replace padding token id's of the labels by -100 so it's ignored by the loss
labels[labels == tokenizer.pad_token_id] = -100  # in our case there is no padding
print(f"input_ids={input_ids}")
print(f"attention_mask={attention_mask}") # all ones
print(f"labels ={labels}")
# forward pass
outputs = model(input_ids=input_ids,labels=labels) 
print(f"Model Loss {outputs.loss}")
# Test the model to check what it predicts next
outputs = model.generate(input_ids=input_ids, attention_mask=attention_mask,max_new_tokens=1)
answer = tokenizer.decode(outputs[0], skip_special_tokens=False)
print(f"Result '{answer}'")

Run Code Online (Sandbox Code Playgroud)

输出

input_ids=tensor([[50256, 19134,   284,   968,  1971]]) # not sure what eostoken (50256) in input does to model
attention_mask=tensor([[1, 1, 1, 1, 1]])
labels =tensor([[14618,   284,   968,  1971,  2254]]) # 2254 = City;  which is that the model should predict
Model Loss 8.248174667358398
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Result '<|endoftext|> Welcome to New York City'

Run Code Online (Sandbox Code Playgroud)

当我尝试像到处做的那样正确的事情时

input_ids=tensor([[50256, 19134,   284,   968,  1971]]) # not sure what eostoken (50256) in input does to model
attention_mask=tensor([[1, 1, 1, 1, 1]])
labels =tensor([[14618,   284,   968,  1971,  2254]]) # 2254 = City;  which is that the model should predict
Model Loss 8.248174667358398
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Result '<|endoftext|> Welcome to New York City'

Run Code Online (Sandbox Code Playgroud)

我损失了大约 3.26

input_text  = "Welcome to New York"
target_text = input_text

Run Code Online (Sandbox Code Playgroud)

Model Loss 3.2614505290985107
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Result 'Welcome to New York City'

Run Code Online (Sandbox Code Playgroud)

是那个吗

input_ids=tensor([[14618,   284,   968,  1971]]) # 1971 = York
attention_mask=tensor([[1, 1, 1, 1]])
labels =tensor([[14618,   284,   968,  1971]])

Run Code Online (Sandbox Code Playgroud)

正在生成超过 1 个代币。

更新-

基于 Jindfitch 的答案 - 将其放在这里，因为当我尝试将其添加为答案时，SO 版主已经删除了。

您尝试微调模型，以绝对确定曼城会以 100% 的概率跟进

我用这个特定的文本训练了 GPT2（仅训练最后 2 层并冻结其他层），并采用损失最低的模型并再次使用该模型进行测试，果然，损失要低得多 -Model Loss 0.01076329406350851

对于任何其他愿意关注的人。训练代码如下。

请注意用这个小文本进行的训练以及我所做的方式，我不太完全确定它是否正确，因为训练损失似乎有点跳跃（在一些时期后会增加，在这种情况下是第 8 时期）

2023-03-12 16:03:20,579 [INFO] Epoch 7 complete. Loss: 0.18975284695625305 saving ./test/gpt2-epoch-8-2023-03-12 16:02:19.289492
2023-03-12 16:03:20,985 [INFO] Epoch 9 of 10
2023-03-12 16:03:27,655 [INFO] Epoch 8 complete. Loss: 0.3775772750377655 saving ./test/gpt2-epoch-9-2023-03-12 16:02:19.289492
2023-03-12 16:03:27,655 [INFO] Epoch 10 of 10
2023-03-12 16:03:34,140 [INFO] Epoch 9 complete. Loss: 6.827305332990363e-05 saving ./test/gpt2-epoch-10-2023-03-12 16:02:19.289492

Run Code Online (Sandbox Code Playgroud)

训练脚本 - https://github.com/alexcpn/tranformer_learn/blob/gpt-loss-learn/gpt2_train_model.py

训练输出日志https://github.com/alexcpn/tranformer_learn/blob/gpt-loss-learn/training/training_2023-03-12%2016%3A02%3A19.289492.log

训练数据 Welcome to New York City （末尾空格） https://github.com/alexcpn/tranformer_learn/blob/gpt-loss-learn/data/small.txt

评估脚本 - https://github.com/alexcpn/tranformer_learn/blob/gpt-loss-learn/older/gpt2_loss_learn.py

在提供要生成的模型时，我从 Input-ids 中删除了与“城市”相对应的标记

# remove the last token off for input-id's as well as attention Mask
input_ids = input_ids[:,:-1] # input_text  = "Welcome to New York"
attention_mask = attention_mask[:,:-1]
print(f"input_ids={input_ids}")
outputs = model.generate(input_ids=input_ids, attention_mask=attention_mask,max_new_tokens=1)

Run Code Online (Sandbox Code Playgroud)

评估脚本输出

python3 ./older/gpt2_loss_learn.py 
input_ids=tensor([[14618,   284,   968,  1971,  2254]])
attention_mask=tensor([[1, 1, 1, 1, 1]])
labels =tensor([[14618,   284,   968,  1971,  2254]])
Model Loss 0.01076329406350851
input_ids=tensor([[14618,   284,   968,  1971]])
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Result 'Welcome to New York City'

Run Code Online (Sandbox Code Playgroud)

更具说明性的示例https://github.com/alexcpn/tranformer_learn/blob/gpt-loss-learn/LLM_Loss_Understand.ipynb

Answer 1

Jin*_*ich 5

默认损失函数是负对数似然。实际的模型输出不是标记City，而是整个 50k 词汇表的分类分布。根据生成策略，您可以从这些分布中进行采样，也可以采用最可能的标记。

令牌City显然是最有可能的令牌，它获得了一定的概率，然后损失减去该概率的对数。损失接近于零意味着代币的概率接近于 1。然而，代币分配也考虑了许多看似合理但可能性较小的后续行动。损失 3.26 对应于的概率exp(-3.26)，约为 3.8%。看起来很小，但在 50k 词汇量中，它的概率比随机猜测高大约 2000 倍。

您可以尝试对模型进行微调，以确保City100% 的概率能够实现，但这可能会破坏其他语言建模功能。

归档时间：	2 年，8 月前
查看次数：	4783 次
最近记录：	2 年，3 月前