Huggingface 的 AutoTokenizer 中的 text_target 参数有什么作用?

Bet*_*tty 4 python huggingface-transformers huggingface

我正在遵循此处的指南: https: //huggingface.co/docs/transformers/v4.28.1/tasks/summarization \n指南中有一行如下:

\n
labels = tokenizer(text_target=examples["summary"], max_length=128, truncation=True)\n
Run Code Online (Sandbox Code Playgroud)\n

我不明白该text_target参数的功能。

\n

我尝试了以下代码,最后两行给出了完全相同的结果。

\n
from transformers import AutoTokenizer\ntokenizer = AutoTokenizer.from_pretrained(\'t5-small\')\ntext = "Weiter Verhandlung in Syrien."\ntokenizer(text_target=text, max_length=128, truncation=True)\ntokenizer(text, max_length=128, truncation=True)\n
Run Code Online (Sandbox Code Playgroud)\n

文档只是说text_target (str, List[str], List[List[str]], optional) \xe2\x80\x94 The sequence or batch of sequences to be encoded as target texts.我不太明白。是否在某些情况下设置 text_target会产生不同的结果?

\n

cro*_*oik 6

有时需要看一下代码

if text is None and text_target is None:
    raise ValueError("You need to specify either `text` or `text_target`.")
if text is not None:
    # The context manager will send the inputs as normal texts and not text_target, but we shouldn't change the
    # input mode in this case.
    if not self._in_target_context_manager:
        self._switch_to_input_mode()
    encodings = self._call_one(text=text, text_pair=text_pair, **all_kwargs)
if text_target is not None:
    self._switch_to_target_mode()
    target_encodings = self._call_one(text=text_target, text_pair=text_pair_target, **all_kwargs)
# Leave back tokenizer in input mode
self._switch_to_input_mode()

if text_target is None:
    return encodings
elif text is None:
    return target_encodings
else:
    encodings["labels"] = target_encodings["input_ids"]
    return encodings
Run Code Online (Sandbox Code Playgroud)

正如您在上面的代码片段中看到的,两者texttext_target被传递给self._call_one()它们以对其进行编码(请注意,它text_target是作为text参数传递的)。这意味着只要不做任何特殊的事情,相同字符串的编码将是text相同的。text_target_switch_to_target_mode()

函数末尾的条件回答了您的问题:

  1. 当您仅提供时,text您将检索它的编码。
  2. 当您仅提供时,text_target您将检索它的编码。
  3. 当您提供时texttext_target您将检索编码text和令牌 IDtext_target作为密钥的值labels

说实话,我认为实现有点不直观。我希望传递text_target将返回一个仅包含labels密钥的对象。我认为他们希望保持输出对象和相应的文档简单,因此选择了这种实现。或者有一个我不知道的模型实际上是有意义的。