如何截断 Huggingface 管道中的输入？

Question

如何截断 Huggingface 管道中的输入？

Eti*_*neT 21 huggingface-transformers huggingface-tokenizers

我目前使用 Huggingface 管道进行情感分析，如下所示：

from transformers import pipeline
classifier = pipeline('sentiment-analysis', device=0)

Run Code Online (Sandbox Code Playgroud)

问题是，当我传递大于 512 个标记的文本时，它会崩溃并提示输入太长。有没有办法将 max_length 和 truncate 参数从分词器直接传递到管道？

我的工作是做：

从转换器导入 AutoTokenizer、AutoModelForSequenceClassification

model_name = "nlptown/bert-base-multilingual-uncased-sentiment"
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
classifier = pipeline('sentiment-analysis', model=model, tokenizer=tokenizer, device=0)

Run Code Online (Sandbox Code Playgroud)

然后当我调用标记器时：

pt_batch = tokenizer(text, padding=True, truncation=True, max_length=512, return_tensors="pt")

Run Code Online (Sandbox Code Playgroud)

但如果能够像这样直接调用管道会更好：

classifier(text, padding=True, truncation=True, max_length=512)

Run Code Online (Sandbox Code Playgroud)

Answer 1

小智 25

您可以在推理时使用 tokenizer_kwargs ：

model_pipline = pipeline("text-classification",model=model,tokenizer=tokenizer,device=0, return_all_scores=True)

tokenizer_kwargs = {'padding':True,'truncation':True,'max_length':512,'return_tensors':'pt'}

prediction = model_pipeline('sample text to predict',**tokenizer_kwargs)

Run Code Online (Sandbox Code Playgroud)

有关更多详细信息，您可以查看此链接

Answer 2

小智 11

这种方式应该有效：

classifier(text, padding=True, truncation=True)

Run Code Online (Sandbox Code Playgroud)

如果它不尝试将分词器加载为：

tokenizer = AutoTokenizer.from_pretrained(model_name, model_max_len=512)

Run Code Online (Sandbox Code Playgroud)

我认为应该是 model_max_length 而不是 model_max_len。否则它对我不起作用。 (2认同)

归档时间：	4 年，8 月前
查看次数：	10993 次
最近记录：	2 年，2 月前