如何从拥抱面部语言模型计算句子级别的困惑度？

Question

如何从拥抱面部语言模型计算句子级别的困惑度？

pil*_*ilu 6 python nlp huggingface-transformers large-language-model huggingface-evaluate

我收集了大量文档，每个文档由大约 10 个句子组成。对于每个文档，我希望找到最大化困惑度的句子，或者等效于微调因果 LM 的损失。我决定使用 Hugging Face 和distilgpt2模型来实现此目的。当尝试以有效（矢量化）方式进行操作时，我遇到两个问题：

分词器需要填充才能在批处理模式下工作，但是在计算填充的损失时，input_ids这些填充标记会造成损失。因此，给定句子的损失取决于批次中最长句子的长度，这显然是错误的。
当我将一批输入 ID 传递给模型并计算损失时，我得到一个标量，因为它（意思是？）跨批次池化。相反，我需要的是每件物品的损失，而不是汇总的损失。

我制作了一个逐句运行的版本，虽然正确，但速度非常慢（我想总共处理约 2500 万个句子）。有什么建议吗？

下面的最小示例：

# Init
tokenizer = AutoTokenizer.from_pretrained("distilgpt2")
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained("clm-gpu/checkpoint-138000")
segmenter = spacy.load('en_core_web_sm')

# That's the part I need to vectorise, surely within a document (bsize ~ 10)
# and ideally across documents (bsize as big as my GPU can handle)
def select_sentence(sentences):
    """We pick the sentence that maximizes perplexity"""
    max_loss, best_index = 0, 0
    for i, sentence in enumerate(sentences):
        encodings = tokenizer(sentence, return_tensors="pt")
        input_ids = encodings.input_ids
        loss = lm(input_ids, labels=input_ids).loss.item()
        if loss > max_loss:
            max_loss = loss
            best_index = i

    return sentences[best_index]

for document in documents:
    sentences = [sentence.text.strip() for sentence in segmenter(document).sents]
    best_sentence = select_sentence(sentences)
    write(best_sentence)

Run Code Online (Sandbox Code Playgroud)

Answer 1

alv*_*vas 10

如果目标是计算困惑度然后选择句子，那么有一种更好的方法可以进行困惑度计算，而无需弄乱标记/模型。

安装https://huggingface.co/spaces/evaluate-metric/perplexity：

pip install -U evaluate

Run Code Online (Sandbox Code Playgroud)

然后：

perplexity = evaluate.load("perplexity", module_type="metric")
input_texts = ["lorem ipsum", "Happy Birthday!", "Bienvenue"]

results = perplexity.compute(model_id='gpt2',
                             add_start_token=False,
                             predictions=input_texts)
print(list(results.keys()))

Run Code Online (Sandbox Code Playgroud)

[出去]：

>>>['perplexities', 'mean_perplexity']
print(round(results["mean_perplexity"], 2))
>>>646.75
print(round(results["perplexities"][0], 2))
>>>32.25

Run Code Online (Sandbox Code Playgroud)

问：这很好，但是如何将它用于无法使用获取的自定义模型`model_id=...`？

答：为此，让我们深入了解一下，https://huggingface.co/spaces/evaluate-metric/perplexity/blob/main/perplexity.py

这是代码初始化模型的方式：

class Perplexity(evaluate.Metric):
    def _info(self):
        return evaluate.MetricInfo(
            module_type="metric",
            description=_DESCRIPTION,
            citation=_CITATION,
            inputs_description=_KWARGS_DESCRIPTION,
            features=datasets.Features(
                {
                    "predictions": datasets.Value("string"),
                }
            ),
            reference_urls=["https://huggingface.co/docs/transformers/perplexity"],
        )

    def _compute(
        self, predictions, model_id, batch_size: int = 16, add_start_token: bool = True, device=None, max_length=None
    ):
        ...
        model = AutoModelForCausalLM.from_pretrained(model_id)
        model = model.to(device)

        tokenizer = AutoTokenizer.from_pretrained(model_id)
        ...

Run Code Online (Sandbox Code Playgroud)

哎呀，不支持本地模型！

如果我们对代码做一些简单的更改会怎么样 =)

请参阅使用 Huggingface Transformer 从磁盘加载预训练模型


class Perplexity(evaluate.Metric):
    def _info(self):
        return evaluate.MetricInfo(
            module_type="metric",
            description=_DESCRIPTION,
            citation=_CITATION,
            inputs_description=_KWARGS_DESCRIPTION,
            features=datasets.Features(
                {
                    "predictions": datasets.Value("string"),
                }
            ),
            reference_urls=["https://huggingface.co/docs/transformers/perplexity"],
        )

    def _compute(
        self, predictions, model_id, batch_size: int = 16, add_start_token: bool = True, device=None, max_length=None, local_file_only: bool = False
    ):
        ...
        model = AutoModelForCausalLM.from_pretrained(model_id, local_files_only=local_file_only)
        model = model.to(device)

        tokenizer = AutoTokenizer.from_pretrained(model_id, local_files_only=local_file_only)

Run Code Online (Sandbox Code Playgroud)

从技术上讲，如果您可以加载本地模型，则可以使用以下方式加载：

AutoModelForCausalLM.from_pretrained("clm-gpu/checkpoint-138000", local_file_only=True)

Run Code Online (Sandbox Code Playgroud)

model_id代码更改后您应该能够这样：

perplexity.compute(model_id="clm-gpu/checkpoint-138000",
                             add_start_token=False,
                             predictions=input_texts, 
                             local_file_only=True)

Run Code Online (Sandbox Code Playgroud)

打开拉请求：https://huggingface.co/spaces/evaluate-metric/perplexity/discussions/4

归档时间：	2 年，7 月前
查看次数：	4767 次
最近记录：	2 年，7 月前

如何从拥抱面部语言模型计算句子级别的困惑度？

问：这很好，但是如何将它用于无法使用 获取的自定义模型model_id=...？

哎呀，不支持本地模型！

问：这很好，但是如何将它用于无法使用获取的自定义模型`model_id=...`？