Luc*_*edo 5 python sentiment-analysis huggingface-transformers huggingface large-language-model
我需要使用它pipeline来从数据集上的模型中获得标记化和推理distilbert-base-uncased-finetuned-sst-2-english。
我的数据是一个句子列表,出于娱乐目的,我们可以假设它是:
texts = ["this is the first sentence", "of my data.", "In fact, thats not true,", "but we are going to assume it", "is"]
在使用之前pipeline,我从模型输出中获取 logits,如下所示:
with torch.no_grad():
logits = model(**tokenized_test).logits
Run Code Online (Sandbox Code Playgroud)
现在我必须使用管道,所以这就是我获取模型输出的方式:
selected_model = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(selected_model)
model = AutoModelForSequenceClassification.from_pretrained(selected_model, num_labels=2)
classifier = pipeline('sentiment-analysis', model=model, tokenizer=tokenizer)
print(classifier(text))
Run Code Online (Sandbox Code Playgroud)
这给了我:
[{'label': 'POSITIVE', 'score': 0.9746173024177551}, {'label': 'NEGATIVE', 'score': 0.5020197629928589}, {'label': 'NEGATIVE', 'score': 0.9995120763778687}, {'label': 'NEGATIVE', 'score': 0.9802979826927185}, {'label': 'POSITIVE', 'score': 0.9274746775627136}]
我再也找不到“logits”字段了。
有没有办法得到 thelogits而不是labeland score?自定义管道是否是最好和/或最简单的方法?
当您使用默认值时pipeline,后处理函数通常会采用softmax,例如
from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased-finetuned-sst-2-english')
model = AutoModelForSequenceClassification.from_pretrained('distilbert-base-uncased-finetuned-sst-2-english')
text = ['hello this is a test',
'that transforms a list of sentences',
'into a list of list of sentences',
'in order to emulate, in this case, two batches of the same lenght',
'to be tokenized by the hf tokenizer for the defined model']
classifier(text, batch_size=2, truncation="only_first")
Run Code Online (Sandbox Code Playgroud)
[出去]:
[{'label': 'NEGATIVE', 'score': 0.9379090666770935},
{'label': 'POSITIVE', 'score': 0.9990271329879761},
{'label': 'NEGATIVE', 'score': 0.9726701378822327},
{'label': 'NEGATIVE', 'score': 0.9965035915374756},
{'label': 'NEGATIVE', 'score': 0.9913086891174316}]
Run Code Online (Sandbox Code Playgroud)
因此,您想要的是通过从管道继承来重载后处理逻辑。
要检查分类器继承哪个管道,请执行以下操作:
classifier = pipeline('sentiment-analysis', model=model, tokenizer=tokenizer)
type(classifier)
Run Code Online (Sandbox Code Playgroud)
[出去]:
transformers.pipelines.text_classification.TextClassificationPipeline
Run Code Online (Sandbox Code Playgroud)
现在您知道了要使用的任务管道的父类,现在您可以执行此操作并仍然享受预编码批处理的好处TextClassificationPipeline:
from transformers import TextClassificationPipeline
class MarioThePlumber(TextClassificationPipeline):
def postprocess(self, model_outputs):
best_class = model_outputs["logits"]
return best_class
pipe = MarioThePlumber(model=model, tokenizer=tokenizer)
pipe(text, batch_size=2, truncation="only_first")
Run Code Online (Sandbox Code Playgroud)
[出去]:
[tensor([[ 1.5094, -1.2056]]),
tensor([[-3.4114, 3.5229]]),
tensor([[ 1.8835, -1.6886]]),
tensor([[ 3.0780, -2.5745]]),
tensor([[ 2.5383, -2.1984]])]
Run Code Online (Sandbox Code Playgroud)