如何从拥抱面部语言模型计算句子级别的困惑度?

pil*_*ilu 6 python nlp huggingface-transformers large-language-model huggingface-evaluate

我收集了大量文档,每个文档由大约 10 个句子组成。对于每个文档,我希望找到最大化困惑度的句子,或者等效于微调因果 LM 的损失。我决定使用 Hugging Face 和distilgpt2模型来实现此目的。当尝试以有效(矢量化)方式进行操作时,我遇到两个问题:

  1. 分词器需要填充才能在批处理模式下工作,但是在计算填充的损失时,input_ids这些填充标记会造成损失。因此,给定句子的损失取决于批次中最长句子的长度,这显然是错误的。

  2. 当我将一批输入 ID 传递给模型并计算损失时,我得到一个标量,因为它(意思是?)跨批次池化。相反,我需要的是每件物品的损失,而不是汇总的损失。

我制作了一个逐句运行的版本,虽然正确,但速度非常慢(我想总共处理约 2500 万个句子)。有什么建议吗?

下面的最小示例:

# Init
tokenizer = AutoTokenizer.from_pretrained("distilgpt2")
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained("clm-gpu/checkpoint-138000")
segmenter = spacy.load('en_core_web_sm')

# That's the part I need to vectorise, surely within a document (bsize ~ 10)
# and ideally across documents (bsize as big as my GPU can handle)
def select_sentence(sentences):
    """We pick the sentence that maximizes perplexity"""
    max_loss, best_index = 0, 0
    for i, sentence in enumerate(sentences):
        encodings = tokenizer(sentence, return_tensors="pt")
        input_ids = encodings.input_ids
        loss = lm(input_ids, labels=input_ids).loss.item()
        if loss > max_loss:
            max_loss = loss
            best_index = i

    return sentences[best_index]

for document in documents:
    sentences = [sentence.text.strip() for sentence in segmenter(document).sents]
    best_sentence = select_sentence(sentences)
    write(best_sentence)

Run Code Online (Sandbox Code Playgroud)

alv*_*vas 10

如果目标是计算困惑度然后选择句子,那么有一种更好的方法可以进行困惑度计算,而无需弄乱标记/模型。

安装https://huggingface.co/spaces/evaluate-metric/perplexity

pip install -U evaluate
Run Code Online (Sandbox Code Playgroud)

然后:

perplexity = evaluate.load("perplexity", module_type="metric")
input_texts = ["lorem ipsum", "Happy Birthday!", "Bienvenue"]

results = perplexity.compute(model_id='gpt2',
                             add_start_token=False,
                             predictions=input_texts)
print(list(results.keys()))

Run Code Online (Sandbox Code Playgroud)

[出去]:

>>>['perplexities', 'mean_perplexity']
print(round(results["mean_perplexity"], 2))
>>>646.75
print(round(results["perplexities"][0], 2))
>>>32.25
Run Code Online (Sandbox Code Playgroud)

问:这很好,但是如何将它用于无法使用 获取的自定义模型model_id=...

答:为此,让我们深入了解一下,https://huggingface.co/spaces/evaluate-metric/perplexity/blob/main/perplexity.py

这是代码初始化模型的方式:

class Perplexity(evaluate.Metric):
    def _info(self):
        return evaluate.MetricInfo(
            module_type="metric",
            description=_DESCRIPTION,
            citation=_CITATION,
            inputs_description=_KWARGS_DESCRIPTION,
            features=datasets.Features(
                {
                    "predictions": datasets.Value("string"),
                }
            ),
            reference_urls=["https://huggingface.co/docs/transformers/perplexity"],
        )

    def _compute(
        self, predictions, model_id, batch_size: int = 16, add_start_token: bool = True, device=None, max_length=None
    ):
        ...
        model = AutoModelForCausalLM.from_pretrained(model_id)
        model = model.to(device)

        tokenizer = AutoTokenizer.from_pretrained(model_id)
        ...
Run Code Online (Sandbox Code Playgroud)

哎呀,不支持本地模型!

如果我们对代码做一些简单的更改会怎么样 =)

请参阅使用 Huggingface Transformer 从磁盘加载预训练模型


class Perplexity(evaluate.Metric):
    def _info(self):
        return evaluate.MetricInfo(
            module_type="metric",
            description=_DESCRIPTION,
            citation=_CITATION,
            inputs_description=_KWARGS_DESCRIPTION,
            features=datasets.Features(
                {
                    "predictions": datasets.Value("string"),
                }
            ),
            reference_urls=["https://huggingface.co/docs/transformers/perplexity"],
        )

    def _compute(
        self, predictions, model_id, batch_size: int = 16, add_start_token: bool = True, device=None, max_length=None, local_file_only: bool = False
    ):
        ...
        model = AutoModelForCausalLM.from_pretrained(model_id, local_files_only=local_file_only)
        model = model.to(device)

        tokenizer = AutoTokenizer.from_pretrained(model_id, local_files_only=local_file_only)
Run Code Online (Sandbox Code Playgroud)

从技术上讲,如果您可以加载本地模型,则可以使用以下方式加载:

AutoModelForCausalLM.from_pretrained("clm-gpu/checkpoint-138000", local_file_only=True)
Run Code Online (Sandbox Code Playgroud)

model_id代码更改后您应该能够这样:

perplexity.compute(model_id="clm-gpu/checkpoint-138000",
                             add_start_token=False,
                             predictions=input_texts, 
                             local_file_only=True)
Run Code Online (Sandbox Code Playgroud)

打开拉请求:https://huggingface.co/spaces/evaluate-metric/perplexity/discussions/4