pil*_*ilu 6 python nlp huggingface-transformers large-language-model huggingface-evaluate
我收集了大量文档,每个文档由大约 10 个句子组成。对于每个文档,我希望找到最大化困惑度的句子,或者等效于微调因果 LM 的损失。我决定使用 Hugging Face 和distilgpt2模型来实现此目的。当尝试以有效(矢量化)方式进行操作时,我遇到两个问题:
分词器需要填充才能在批处理模式下工作,但是在计算填充的损失时,input_ids这些填充标记会造成损失。因此,给定句子的损失取决于批次中最长句子的长度,这显然是错误的。
当我将一批输入 ID 传递给模型并计算损失时,我得到一个标量,因为它(意思是?)跨批次池化。相反,我需要的是每件物品的损失,而不是汇总的损失。
我制作了一个逐句运行的版本,虽然正确,但速度非常慢(我想总共处理约 2500 万个句子)。有什么建议吗?
下面的最小示例:
# Init
tokenizer = AutoTokenizer.from_pretrained("distilgpt2")
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained("clm-gpu/checkpoint-138000")
segmenter = spacy.load('en_core_web_sm')
# That's the part I need to vectorise, surely within a document (bsize ~ 10)
# and ideally across documents (bsize as big as my GPU can handle)
def select_sentence(sentences):
"""We pick the sentence that maximizes perplexity"""
max_loss, best_index = 0, 0
for i, sentence in enumerate(sentences):
encodings = tokenizer(sentence, return_tensors="pt")
input_ids = encodings.input_ids
loss = lm(input_ids, labels=input_ids).loss.item()
if loss > max_loss:
max_loss = loss
best_index = i
return sentences[best_index]
for document in documents:
sentences = [sentence.text.strip() for sentence in segmenter(document).sents]
best_sentence = select_sentence(sentences)
write(best_sentence)
Run Code Online (Sandbox Code Playgroud)
alv*_*vas 10
如果目标是计算困惑度然后选择句子,那么有一种更好的方法可以进行困惑度计算,而无需弄乱标记/模型。
安装https://huggingface.co/spaces/evaluate-metric/perplexity:
pip install -U evaluate
Run Code Online (Sandbox Code Playgroud)
然后:
perplexity = evaluate.load("perplexity", module_type="metric")
input_texts = ["lorem ipsum", "Happy Birthday!", "Bienvenue"]
results = perplexity.compute(model_id='gpt2',
add_start_token=False,
predictions=input_texts)
print(list(results.keys()))
Run Code Online (Sandbox Code Playgroud)
[出去]:
>>>['perplexities', 'mean_perplexity']
print(round(results["mean_perplexity"], 2))
>>>646.75
print(round(results["perplexities"][0], 2))
>>>32.25
Run Code Online (Sandbox Code Playgroud)
model_id=...?答:为此,让我们深入了解一下,https://huggingface.co/spaces/evaluate-metric/perplexity/blob/main/perplexity.py
这是代码初始化模型的方式:
class Perplexity(evaluate.Metric):
def _info(self):
return evaluate.MetricInfo(
module_type="metric",
description=_DESCRIPTION,
citation=_CITATION,
inputs_description=_KWARGS_DESCRIPTION,
features=datasets.Features(
{
"predictions": datasets.Value("string"),
}
),
reference_urls=["https://huggingface.co/docs/transformers/perplexity"],
)
def _compute(
self, predictions, model_id, batch_size: int = 16, add_start_token: bool = True, device=None, max_length=None
):
...
model = AutoModelForCausalLM.from_pretrained(model_id)
model = model.to(device)
tokenizer = AutoTokenizer.from_pretrained(model_id)
...
Run Code Online (Sandbox Code Playgroud)
如果我们对代码做一些简单的更改会怎么样 =)
请参阅使用 Huggingface Transformer 从磁盘加载预训练模型
class Perplexity(evaluate.Metric):
def _info(self):
return evaluate.MetricInfo(
module_type="metric",
description=_DESCRIPTION,
citation=_CITATION,
inputs_description=_KWARGS_DESCRIPTION,
features=datasets.Features(
{
"predictions": datasets.Value("string"),
}
),
reference_urls=["https://huggingface.co/docs/transformers/perplexity"],
)
def _compute(
self, predictions, model_id, batch_size: int = 16, add_start_token: bool = True, device=None, max_length=None, local_file_only: bool = False
):
...
model = AutoModelForCausalLM.from_pretrained(model_id, local_files_only=local_file_only)
model = model.to(device)
tokenizer = AutoTokenizer.from_pretrained(model_id, local_files_only=local_file_only)
Run Code Online (Sandbox Code Playgroud)
从技术上讲,如果您可以加载本地模型,则可以使用以下方式加载:
AutoModelForCausalLM.from_pretrained("clm-gpu/checkpoint-138000", local_file_only=True)
Run Code Online (Sandbox Code Playgroud)
model_id代码更改后您应该能够这样:
perplexity.compute(model_id="clm-gpu/checkpoint-138000",
add_start_token=False,
predictions=input_texts,
local_file_only=True)
Run Code Online (Sandbox Code Playgroud)
打开拉请求:https://huggingface.co/spaces/evaluate-metric/perplexity/discussions/4
| 归档时间: |
|
| 查看次数: |
4767 次 |
| 最近记录: |