Hugging Face中tokenizer.encode和tokenizer.encode_plus有什么区别

Question

Hugging Face中tokenizer.encode和tokenizer.encode_plus有什么区别

这是一个使用模型进行序列分类的示例，以确定两个序列是否是彼此的释义。这两个例子给出了两种不同的结果。你能帮我解释为什么tokenizer.encode并tokenizer.encode_plus给出不同的结果吗？

示例 1（带有.encode_plus()）：

paraphrase = tokenizer.encode_plus(sequence_0, sequence_2, return_tensors="pt")
not_paraphrase = tokenizer.encode_plus(sequence_0, sequence_1, return_tensors="pt")

paraphrase_classification_logits = model(**paraphrase)[0]
not_paraphrase_classification_logits = model(**not_paraphrase)[0]

Run Code Online (Sandbox Code Playgroud)

示例 2（带有.encode()）：

paraphrase = tokenizer.encode(sequence_0, sequence_2, return_tensors="pt")
not_paraphrase = tokenizer.encode(sequence_0, sequence_1, return_tensors="pt")

paraphrase_classification_logits = model(paraphrase)[0]
not_paraphrase_classification_logits = model(not_paraphrase)[0]

Run Code Online (Sandbox Code Playgroud)

Answer 1

den*_*ger 28

主要区别源于提供的附加信息encode_plus。如果您阅读有关相应功能的文档，那么以下内容略有不同encode()：

使用标记器和词汇表转换 id 序列（整数）中的字符串。和做一样self.convert_tokens_to_ids(self.tokenize(text))。

和描述encode_plus()：

返回包含编码序列或序列对和附加信息的字典：序列分类的掩码和溢出元素（如果max_length指定了 a ）。

根据您指定的模型和输入句子，不同之处在于额外编码的信息，特别是输入掩码。由于您一次输入两个句子，BERT（以及可能的其他模型变体），因此需要某种形式的掩码，这允许模型区分两个序列，请参见此处。由于encode_plus 是提供这一信息，但encode 没有，你会得到不同的输出结果。

Answer 2

Osc*_*gel 9

tokenizer.encode_plus函数为我们组合了多个步骤：

1.- 将句子拆分为标记。2.- 添加特殊的 [CLS] 和 [SEP] 令牌。3.- 将令牌映射到其 ID。4.- 将所有句子填充或截断为相同长度。5.- 创建注意力掩码，明确区分真实令牌和 [PAD] 令牌。

文档在这里

# Tokenize all of the sentences and map the tokens to thier word IDs.
input_ids = []
attention_masks = []

# For every sentence...
for sent in sentences:
    # `encode_plus` will:
    #   (1) Tokenize the sentence.
    #   (2) Prepend the `[CLS]` token to the start.
    #   (3) Append the `[SEP]` token to the end.
    #   (4) Map tokens to their IDs.
    #   (5) Pad or truncate the sentence to `max_length`
    #   (6) Create attention masks for [PAD] tokens.
    encoded_dict = tokenizer.encode_plus(
                        sent,                      # Sentence to encode.
                        add_special_tokens = True, # Add '[CLS]' and '[SEP]'
                        max_length = 64,           # Pad & truncate all sentences.
                        pad_to_max_length = True,
                        return_attention_mask = True,   # Construct attn. masks.
                        return_tensors = 'pt',     # Return pytorch tensors.
                   )

Run Code Online (Sandbox Code Playgroud)

归档时间：	5 年，6 月前
查看次数：	13303 次
最近记录：	5 年，6 月前