Ale*_*nen 2 pytorch huggingface-transformers gpt-2
(也发布在这里https://discuss.huggingface.co/t/newbie-understanding-gpt2-loss/33590)
我对 GPT2 损失的理解陷入困境。我想为模型提供具有它将生成的目标的标签,以便我可以看到损失为零。
我有一个输入文本
input_text = "Welcome to New York"
当前模型预测下一个单词为City
如果我将标签指定为 input_text,则损失永远不会为零。我如何模拟给出“欢迎来到纽约市”标签,以便内部神经网络(无论模型如何)给出零或接近零的损失?
为了更多地解释我的意思,这是片段。
注意 - 我已阅读论坛和文档,标签可以与输入文本相同,模型将向左移动标签,并且不会计算最后一个标记的损失。但损失仍然应该为零,但事实并非如此。
语言建模的标签。请注意,标签在模型内部移动,即您可以设置 labels = input_ids...。
from transformers import GPT2LMHeadModel, GPT2Tokenizer
model_name = 'gpt2'
tokenizer = GPT2Tokenizer.from_pretrained(model_name,model_max_length=1024,padding_side='left')
tokenizer.pad_token = tokenizer.eos_token # == <|endoftext|> = 50256
model = GPT2LMHeadModel.from_pretrained(model_name)
batch_size=5
input_text = "<|endoftext|> Welcome to New York"
target_text = "Welcome to New York City"
# encode the inputs
encoding = tokenizer(input_text,padding=True,max_length=batch_size,truncation=True,return_tensors="pt",)
input_ids, attention_mask = encoding.input_ids, encoding.attention_mask
# encode the targets
target_encoding = tokenizer(target_text,padding=True, max_length=batch_size, truncation=True,return_tensors="pt",)
labels = target_encoding.input_ids
# replace padding token id's of the labels by -100 so it's ignored by the loss
labels[labels == tokenizer.pad_token_id] = -100 # in our case there is no padding
print(f"input_ids={input_ids}")
print(f"attention_mask={attention_mask}") # all ones
print(f"labels ={labels}")
# forward pass
outputs = model(input_ids=input_ids,labels=labels)
print(f"Model Loss {outputs.loss}")
# Test the model to check what it predicts next
outputs = model.generate(input_ids=input_ids, attention_mask=attention_mask,max_new_tokens=1)
answer = tokenizer.decode(outputs[0], skip_special_tokens=False)
print(f"Result '{answer}'")
Run Code Online (Sandbox Code Playgroud)
输出
input_ids=tensor([[50256, 19134, 284, 968, 1971]]) # not sure what eostoken (50256) in input does to model
attention_mask=tensor([[1, 1, 1, 1, 1]])
labels =tensor([[14618, 284, 968, 1971, 2254]]) # 2254 = City; which is that the model should predict
Model Loss 8.248174667358398
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Result '<|endoftext|> Welcome to New York City'
Run Code Online (Sandbox Code Playgroud)
当我尝试像到处做的那样正确的事情时
input_ids=tensor([[50256, 19134, 284, 968, 1971]]) # not sure what eostoken (50256) in input does to model
attention_mask=tensor([[1, 1, 1, 1, 1]])
labels =tensor([[14618, 284, 968, 1971, 2254]]) # 2254 = City; which is that the model should predict
Model Loss 8.248174667358398
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Result '<|endoftext|> Welcome to New York City'
Run Code Online (Sandbox Code Playgroud)
我损失了大约 3.26
input_text = "Welcome to New York"
target_text = input_text
Run Code Online (Sandbox Code Playgroud)
Model Loss 3.2614505290985107
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Result 'Welcome to New York City'
Run Code Online (Sandbox Code Playgroud)
是那个吗
input_ids=tensor([[14618, 284, 968, 1971]]) # 1971 = York
attention_mask=tensor([[1, 1, 1, 1]])
labels =tensor([[14618, 284, 968, 1971]])
Run Code Online (Sandbox Code Playgroud)
正在生成超过 1 个代币。
基于 Jindfitch 的答案 - 将其放在这里,因为当我尝试将其添加为答案时,SO 版主已经删除了。
您尝试微调模型,以绝对确定曼城会以 100% 的概率跟进
我用这个特定的文本训练了 GPT2(仅训练最后 2 层并冻结其他层),并采用损失最低的模型并再次使用该模型进行测试,果然,损失要低得多 -Model Loss 0.01076329406350851
对于任何其他愿意关注的人。训练代码如下。
请注意用这个小文本进行的训练以及我所做的方式,我不太完全确定它是否正确,因为训练损失似乎有点跳跃(在一些时期后会增加,在这种情况下是第 8 时期)
2023-03-12 16:03:20,579 [INFO] Epoch 7 complete. Loss: 0.18975284695625305 saving ./test/gpt2-epoch-8-2023-03-12 16:02:19.289492
2023-03-12 16:03:20,985 [INFO] Epoch 9 of 10
2023-03-12 16:03:27,655 [INFO] Epoch 8 complete. Loss: 0.3775772750377655 saving ./test/gpt2-epoch-9-2023-03-12 16:02:19.289492
2023-03-12 16:03:27,655 [INFO] Epoch 10 of 10
2023-03-12 16:03:34,140 [INFO] Epoch 9 complete. Loss: 6.827305332990363e-05 saving ./test/gpt2-epoch-10-2023-03-12 16:02:19.289492
Run Code Online (Sandbox Code Playgroud)
训练脚本 - https://github.com/alexcpn/tranformer_learn/blob/gpt-loss-learn/gpt2_train_model.py
训练数据
Welcome to New York City (末尾空格)
https://github.com/alexcpn/tranformer_learn/blob/gpt-loss-learn/data/small.txt
评估脚本 - https://github.com/alexcpn/tranformer_learn/blob/gpt-loss-learn/older/gpt2_loss_learn.py
在提供要生成的模型时,我从 Input-ids 中删除了与“城市”相对应的标记
# remove the last token off for input-id's as well as attention Mask
input_ids = input_ids[:,:-1] # input_text = "Welcome to New York"
attention_mask = attention_mask[:,:-1]
print(f"input_ids={input_ids}")
outputs = model.generate(input_ids=input_ids, attention_mask=attention_mask,max_new_tokens=1)
Run Code Online (Sandbox Code Playgroud)
评估脚本输出
python3 ./older/gpt2_loss_learn.py
input_ids=tensor([[14618, 284, 968, 1971, 2254]])
attention_mask=tensor([[1, 1, 1, 1, 1]])
labels =tensor([[14618, 284, 968, 1971, 2254]])
Model Loss 0.01076329406350851
input_ids=tensor([[14618, 284, 968, 1971]])
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Result 'Welcome to New York City'
Run Code Online (Sandbox Code Playgroud)
更具说明性的示例https://github.com/alexcpn/tranformer_learn/blob/gpt-loss-learn/LLM_Loss_Understand.ipynb
默认损失函数是负对数似然。实际的模型输出不是标记City,而是整个 50k 词汇表的分类分布。根据生成策略,您可以从这些分布中进行采样,也可以采用最可能的标记。
令牌City显然是最有可能的令牌,它获得了一定的概率,然后损失减去该概率的对数。损失接近于零意味着代币的概率接近于 1。然而,代币分配也考虑了许多看似合理但可能性较小的后续行动。损失 3.26 对应于 的概率exp(-3.26),约为 3.8%。看起来很小,但在 50k 词汇量中,它的概率比随机猜测高大约 2000 倍。
您可以尝试对模型进行微调,以确保City100% 的概率能够实现,但这可能会破坏其他语言建模功能。
| 归档时间: |
|
| 查看次数: |
4783 次 |
| 最近记录: |