我们是否应该使用 Huggingface（预）训练一个 BERT 无框模型的小写输入数据？

Question

我们是否应该使用 Huggingface（预）训练一个 BERT 无框模型的小写输入数据？

CAR*_*man 3 nlp deep-learning pytorch huggingface-transformers

我们是否应该使用 Huggingface（预）训练一个 BERT 无框模型的小写输入数据？我查看了 Thomas Wolf ( https://github.com/huggingface/transformers/issues/92#issuecomment-444677920 ) 的回复，但不完全确定他是否是这个意思。

如果我们小写文本会发生什么？

Answer 1

Zab*_*azi 5

Tokenizer 会处理这个问题。

一个简单的例子：

import torch
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', max_length = 10, padding_side = 'right')

input_ids = torch.tensor(tokenizer.encode('this is a cat', add_special_tokens=True, max_length = 10, pad_to_max_length = True)).unsqueeze(0)
print(input_ids)

input_ids = torch.tensor(tokenizer.encode('This is a Cat', add_special_tokens=True, max_length = 10, pad_to_max_length = True)).unsqueeze(0)
print(input_ids)

Run Code Online (Sandbox Code Playgroud)

出去：

tensor([[ 101, 2023, 2003, 1037, 4937,  102,    0,    0,    0,    0]])
tensor([[ 101, 2023, 2003, 1037, 4937,  102,    0,    0,    0,    0]])

Run Code Online (Sandbox Code Playgroud)

但万一情况下，

tokenizer = BertTokenizer.from_pretrained('bert-base-cased', max_length = 10, padding_side = 'right')

input_ids = torch.tensor(tokenizer.encode('this is a cat', add_special_tokens=True, max_length = 10, pad_to_max_length = True)).unsqueeze(0)
print(input_ids)

input_ids = torch.tensor(tokenizer.encode('This is a Cat', add_special_tokens=True, max_length = 10, pad_to_max_length = True)).unsqueeze(0)
print(input_ids)

Run Code Online (Sandbox Code Playgroud)

tensor([[ 101, 1142, 1110,  170, 5855,  102,    0,    0,    0,    0]])

tensor([[ 101, 1188, 1110,  170, 8572,  102,    0,    0,    0,    0]])

Run Code Online (Sandbox Code Playgroud)

一般来说，与案例处理有关的分词器行为在模型的 tokenizer_config.json 属性 do_lower_case 中指定。 (2认同)

归档时间：	5 年，2 月前
查看次数：	1317 次
最近记录：	5 年，2 月前