BertTokenizer - 当编码和解码序列出现额外空格时

Question

BertTokenizer - 当编码和解码序列出现额外空格时

Hen*_*ski 7 python tokenize torch pytorch bert-language-model

使用 HuggingFace 的 Transformers 时，我遇到了编码和解码方法的问题。

我有以下字符串：

test_string = 'text with percentage%'

Run Code Online (Sandbox Code Playgroud)

然后我运行以下代码：

import torch
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-cased')

test_string = 'text with percentage%'

# encode Converts a string in a sequence of ids (integer), using the tokenizer and vocabulary.
input_ids = tokenizer.encode(test_string)
output = tokenizer.decode(input_ids)

Run Code Online (Sandbox Code Playgroud)

输出如下所示：

'text with percentage %'

Run Code Online (Sandbox Code Playgroud)

在 % 前有一个额外的空格。我已经尝试了额外的参数，clean_up_tokenization_spaces 但这是不同的。

我应该如何在解码和编码中使用什么来获得前后完全相同的文本。这也发生在其他特殊标志上。

Answer 1

ver*_*uth 3

如果您尝试使用 BERT 进行标记分类，以便在原始字符串中查找范围，那么一种解决方法是BertTokenizerFast与选项一起使用return_offsets_mapping=True。

test_string = 'text with percentage%'

tokenizer = BertTokenizerFast.from_pretrained('bert-base-cased')
tokens = tokenizer(test_string, return_offsets_mapping=True)
input_ids = tokens.data["input_ids"]

span_start_index, span_stop_index = some_model(input_ids)

Run Code Online (Sandbox Code Playgroud)

然后，一旦获得令牌分类结果，您就可以执行类似的操作

predicted_span = test_string[tokens.encodings[0].offsets[span_start_index][0]:tokens.encodings[0].offsets[span_stop_index][1]]

Run Code Online (Sandbox Code Playgroud)

归档时间：	5 年，11 月前
查看次数：	4265 次
最近记录：	4 年，8 月前