生成给定文本的所有下一个可能单词的概率

dat*_*ger 7 text pytorch huggingface-transformers gpt-2

我有以下代码

import transformers
from transformers import pipeline

# Load the language model pipeline
model = pipeline("text-generation", model="gpt2")

# Input sentence for generating next word predictions
input_sentence = "I enjoy walking in the"
Run Code Online (Sandbox Code Playgroud)

我只想在给定输入句子的情况下生成下一个单词,但我想查看所有可能的下一个单词及其概率的列表。任何其他LLM都可以使用,我以gpt2为例。

在代码中,我想仅为下一个单词选择前 500 个单词或前 1000 个单词建议以及每个建议单词的概率,我该怎么做?

Rua*_*uan 8

我们必须更底层,因为该pipeline功能不适合您想要做的事情。

将序列传递给 后AutoModelForCausalLM,输出中的最后一个张量将包含词汇表中每个标记作为下一个标记的概率。在下面的代码中,我将其称为next_token_candidates_tensor. 之后,您只需选择 topk 候选者的索引并将其解码回单词即可。

import torch
from transformers import AutoModelForCausalLM , AutoTokenizer

class LMHeadModel:

    def __init__(self, model_name):
        # Initialize the model and the tokenizer.
        self.model = AutoModelForCausalLM.from_pretrained(model_name)
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
    
    def get_predictions(self, sentence):
        # Encode the sentence using the tokenizer and return the model predictions.
        inputs = self.tokenizer.encode(sentence, return_tensors="pt")
        with torch.no_grad():
            outputs = self.model(inputs)
            predictions = outputs[0]
        return predictions
    
    def get_next_word_probabilities(self, sentence, top_k=500):

        # Get the model predictions for the sentence.
        predictions = self.get_predictions(sentence)
        
        # Get the next token candidates.
        next_token_candidates_tensor = predictions[0, -1, :]

        # Get the top k next token candidates.
        topk_candidates_indexes = torch.topk(
            next_token_candidates_tensor, top_k).indices.tolist()

        # Get the token probabilities for all candidates.
        all_candidates_probabilities = torch.nn.functional.softmax(
            next_token_candidates_tensor, dim=-1)
        
        # Filter the token probabilities for the top k candidates.
        topk_candidates_probabilities = \
            all_candidates_probabilities[topk_candidates_indexes].tolist()

        # Decode the top k candidates back to words.
        topk_candidates_tokens = \
            [self.tokenizer.decode([idx]).strip() for idx in topk_candidates_indexes]

        # Return the top k candidates and their probabilities.
        return list(zip(topk_candidates_tokens, topk_candidates_probabilities))


sentence = "I enjoy walking in the"
model = LMHeadModel("gpt2")
model.get_next_word_probabilities(sentence, top_k=500)

# [('park', 0.15904344618320465),
# ('woods', 0.10028065741062164),
# ('streets', 0.0418376550078392),
# ('dark', 0.03117542900145054),
# ('door', 0.029618268832564354),
# ('street', 0.02388935722410679),
# ('rain', 0.021733922883868217),
# ...
Run Code Online (Sandbox Code Playgroud)


cro*_*oik 2

我认为当您避免使用管道并仅使用相应的语言建模类时,您会帮自己一个忙。您需要做的就是:

  1. 获取下一个标记的 logits(gpt-2 使用不一定是单词的标记),
  2. 应用softmax来获取概率
  3. 应用topk来检索 k 个最可能的标记。
import torch
from transformers import GPT2TokenizerFast, GPT2LMHeadModel

t = GPT2TokenizerFast.from_pretrained("gpt2")
m = GPT2LMHeadModel.from_pretrained("gpt2")

encoded_text = t("I enjoy walking in the", return_tensors="pt")

#1. step to get the logits of the next token
with torch.inference_mode():
  outputs = m(**encoded_text)

next_token_logits = outputs.logits[0, -1, :]
print(next_token_logits.shape)
print(next_token_logits)

# 2. step to convert the logits to probabilities
next_token_probs = torch.softmax(next_token_logits, -1)

# 3. step to get the top 10
topk_next_tokens= torch.topk(next_token_probs, 10)

#putting it together
print(*[(t.decode(idx), prob) for idx, prob in zip(topk_next_tokens.indices, topk_next_tokens.values)], sep="\n")
Run Code Online (Sandbox Code Playgroud)

输出:

torch.Size([50257])
tensor([ -95.1139,  -93.7291,  -97.5711,  ...,  -98.0303, -100.2803,
         -96.1145])
(' park', tensor(0.1590))
(' woods', tensor(0.1003))
(' streets', tensor(0.0418))
(' dark', tensor(0.0312))
(' door', tensor(0.0296))
(' street', tensor(0.0239))
(' rain', tensor(0.0217))
(' city', tensor(0.0189))
(' same', tensor(0.0150))
(' halls', tensor(0.0135))
Run Code Online (Sandbox Code Playgroud)