如何通过 HuggingFace 的文本分类管道获取模型的 logits?

Luc*_*edo 5 python sentiment-analysis huggingface-transformers huggingface large-language-model

我需要使用它pipeline来从数据集上的模型中获得标记化和推理distilbert-base-uncased-finetuned-sst-2-english

我的数据是一个句子列表,出于娱乐目的,我们可以假设它是:

texts = ["this is the first sentence", "of my data.", "In fact, thats not true,", "but we are going to assume it", "is"]

在使用之前pipeline,我从模型输出中获取 logits,如下所示:

with torch.no_grad():
     logits = model(**tokenized_test).logits
Run Code Online (Sandbox Code Playgroud)

现在我必须使用管道,所以这就是我获取模型输出的方式:

 selected_model = "distilbert-base-uncased-finetuned-sst-2-english"
 tokenizer = AutoTokenizer.from_pretrained(selected_model)
 model = AutoModelForSequenceClassification.from_pretrained(selected_model, num_labels=2)
 classifier = pipeline('sentiment-analysis', model=model, tokenizer=tokenizer)
 print(classifier(text))
Run Code Online (Sandbox Code Playgroud)

这给了我:

[{'label': 'POSITIVE', 'score': 0.9746173024177551}, {'label': 'NEGATIVE', 'score': 0.5020197629928589}, {'label': 'NEGATIVE', 'score': 0.9995120763778687}, {'label': 'NEGATIVE', 'score': 0.9802979826927185}, {'label': 'POSITIVE', 'score': 0.9274746775627136}]

我再也找不到“logits”字段了。

有没有办法得到 thelogits而不是labeland score?自定义管道是否是最好和/或最简单的方法?

alv*_*vas 6

当您使用默认值时pipeline,后处理函数通常会采用softmax,例如

from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased-finetuned-sst-2-english')
model = AutoModelForSequenceClassification.from_pretrained('distilbert-base-uncased-finetuned-sst-2-english')


text = ['hello this is a test',
 'that transforms a list of sentences',
 'into a list of list of sentences',
 'in order to emulate, in this case, two batches of the same lenght',
 'to be tokenized by the hf tokenizer for the defined model']

classifier(text, batch_size=2, truncation="only_first")
Run Code Online (Sandbox Code Playgroud)

[出去]:

[{'label': 'NEGATIVE', 'score': 0.9379090666770935},
 {'label': 'POSITIVE', 'score': 0.9990271329879761},
 {'label': 'NEGATIVE', 'score': 0.9726701378822327},
 {'label': 'NEGATIVE', 'score': 0.9965035915374756},
 {'label': 'NEGATIVE', 'score': 0.9913086891174316}]
Run Code Online (Sandbox Code Playgroud)

因此,您想要的是通过从管道继承来重载后处理逻辑。

要检查分类器继承哪个管道,请执行以下操作:

classifier = pipeline('sentiment-analysis', model=model, tokenizer=tokenizer)
type(classifier)
Run Code Online (Sandbox Code Playgroud)

[出去]:

transformers.pipelines.text_classification.TextClassificationPipeline
Run Code Online (Sandbox Code Playgroud)

现在您知道了要使用的任务管道的父类,现在您可以执行此操作并仍然享受预编码批处理的好处TextClassificationPipeline

from transformers import TextClassificationPipeline

class MarioThePlumber(TextClassificationPipeline):
    def postprocess(self, model_outputs):
        best_class = model_outputs["logits"]
        return best_class

pipe = MarioThePlumber(model=model, tokenizer=tokenizer)

pipe(text, batch_size=2, truncation="only_first")
Run Code Online (Sandbox Code Playgroud)

[出去]:

[tensor([[ 1.5094, -1.2056]]),
 tensor([[-3.4114,  3.5229]]),
 tensor([[ 1.8835, -1.6886]]),
 tensor([[ 3.0780, -2.5745]]),
 tensor([[ 2.5383, -2.1984]])]
Run Code Online (Sandbox Code Playgroud)