有没有办法保存预编译的 AutoTokenizer?

alv*_*vas 5 python nlp tokenize huggingface-tokenizers huggingface

有时,我们必须这样做来扩展预先训练的分词器:

from transformers import AutoTokenizer

from datasets import load_dataset


ds_de = load_dataset("mc4", 'de')
ds_fr = load_dataset("mc4", 'fr')

de_tokenizer = tokenizer.train_new_from_iterator(
    ds_de['text'],vocab_size=50_000
)

fr_tokenizer = tokenizer.train_new_from_iterator(
    ds_fr['text'],vocab_size=50_000
)

new_tokens_de = set(de_tokenizer.vocab).difference(tokenizer.vocab)
new_tokens_fr = set(fr_tokenizer.vocab).difference(tokenizer.vocab)
new_tokens = set(new_tokens_de).union(new_tokens_fr)


tokenizer = AutoTokenizer.from_pretrained(
    'moussaKam/frugalscore_tiny_bert-base_bert-score'
)

tokenizer.add_tokens(list(new_tokens))

tokenizer.save_pretrained('frugalscore_tiny_bert-de-fr')
Run Code Online (Sandbox Code Playgroud)

然后在加载分词器时,

tokenizer = AutoTokenizer.from_pretrained(
  'frugalscore_tiny_bert-de-fr', local_files_only=True
)
Run Code Online (Sandbox Code Playgroud)

%%time从Jupyter 单元中加载需要很长时间:

CPU times: user 34min 20s
Wall time: 34min 22s
Run Code Online (Sandbox Code Playgroud)

我猜这是由于添加的令牌的正则表达式编译所致,这也在https://github.com/huggingface/tokenizers/issues/914中提出

我认为没关系,因为它会加载一次,并且无需重新进行正则表达式编译即可完成工作。

但是,有没有办法以二进制形式保存标记生成器并避免下次进行整个正则表达式编译?

小智 1

from transformers import AutoTokenizer

# Instantiate the tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# Save the tokenizer to a directory in binary format
tokenizer.save_pretrained("/path/to/save_directory", save_tokenizer=True)

# Load the tokenizer from the saved directory without recompiling the regex
loaded_tokenizer = AutoTokenizer.from_pretrained("/path/to/save_directory")
Run Code Online (Sandbox Code Playgroud)