Bet*_*tty 4 python huggingface-transformers huggingface
我正在遵循此处的指南: https: //huggingface.co/docs/transformers/v4.28.1/tasks/summarization \n指南中有一行如下:
\nlabels = tokenizer(text_target=examples["summary"], max_length=128, truncation=True)\nRun Code Online (Sandbox Code Playgroud)\n我不明白该text_target参数的功能。
我尝试了以下代码,最后两行给出了完全相同的结果。
\nfrom transformers import AutoTokenizer\ntokenizer = AutoTokenizer.from_pretrained(\'t5-small\')\ntext = "Weiter Verhandlung in Syrien."\ntokenizer(text_target=text, max_length=128, truncation=True)\ntokenizer(text, max_length=128, truncation=True)\nRun Code Online (Sandbox Code Playgroud)\n文档只是说text_target (str, List[str], List[List[str]], optional) \xe2\x80\x94 The sequence or batch of sequences to be encoded as target texts.我不太明白。是否在某些情况下设置 text_target会产生不同的结果?
有时需要看一下代码:
if text is None and text_target is None:
raise ValueError("You need to specify either `text` or `text_target`.")
if text is not None:
# The context manager will send the inputs as normal texts and not text_target, but we shouldn't change the
# input mode in this case.
if not self._in_target_context_manager:
self._switch_to_input_mode()
encodings = self._call_one(text=text, text_pair=text_pair, **all_kwargs)
if text_target is not None:
self._switch_to_target_mode()
target_encodings = self._call_one(text=text_target, text_pair=text_pair_target, **all_kwargs)
# Leave back tokenizer in input mode
self._switch_to_input_mode()
if text_target is None:
return encodings
elif text is None:
return target_encodings
else:
encodings["labels"] = target_encodings["input_ids"]
return encodings
Run Code Online (Sandbox Code Playgroud)
正如您在上面的代码片段中看到的,两者text都text_target被传递给self._call_one()它们以对其进行编码(请注意,它text_target是作为text参数传递的)。这意味着只要不做任何特殊的事情,相同字符串的编码将是text相同的。text_target_switch_to_target_mode()
函数末尾的条件回答了您的问题:
text您将检索它的编码。text_target您将检索它的编码。text,text_target您将检索编码text和令牌 IDtext_target作为密钥的值labels。说实话,我认为实现有点不直观。我希望传递text_target将返回一个仅包含labels密钥的对象。我认为他们希望保持输出对象和相应的文档简单,因此选择了这种实现。或者有一个我不知道的模型实际上是有意义的。