如何使用transformers.BertTokenizer 对多个setence 进行编码?

Lei*_*Hao 5 word-embedding huggingface-transformers huggingface-tokenizers

我想通过使用 transform.BertTokenizer 对多个句子进行编码来创建一个小批量。它似乎适用于单个句子。如何使它适用于几个句子?

from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# tokenize a single sentence seems working
tokenizer.encode('this is the first sentence')
>>> [2023, 2003, 1996, 2034, 6251]

# tokenize two sentences
tokenizer.encode(['this is the first sentence', 'another setence'])
>>> [100, 100] # expecting 7 tokens
Run Code Online (Sandbox Code Playgroud)

cro*_*oik 7

使用tokenizer.batch_encode_plus文档)。它将生成一个字典,其中包含每个输入句子的input_ids,token_type_idsattention_maskas 列表:

tokenizer.batch_encode_plus(['this is the first sentence', 'another setence'])
Run Code Online (Sandbox Code Playgroud)

输出:

{'input_ids': [[101, 2023, 2003, 1996, 2034, 6251, 102], [101, 2178, 2275, 10127, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1]]}
Run Code Online (Sandbox Code Playgroud)

如果您只想生成 input_ids,则必须将return_token_type_idsans设置return_attention_mask为 False:

{'input_ids': [[101, 2023, 2003, 1996, 2034, 6251, 102], [101, 2178, 2275, 10127, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1]]}
Run Code Online (Sandbox Code Playgroud)

输出:

{'input_ids': [[101, 2023, 2003, 1996, 2034, 6251, 102], [101, 2178, 2275, 10127, 102]]}
Run Code Online (Sandbox Code Playgroud)