AutoTokenizer.from_pretrained 无法加载本地保存的预训练分词器 (PyTorch)

fer*_*567 7 python deep-learning pytorch huggingface-transformers huggingface-tokenizers

我是 PyTorch 的新手,最近我一直在尝试使用 Transformers。我正在使用 HuggingFace 提供的预训练分词器。
我成功下载并运行它们。但如果我尝试保存它们并再次加载,则会发生一些错误。
如果我用来 AutoTokenizer.from_pretrained下载分词器,那么它就可以工作。

[1]:    tokenizer = AutoTokenizer.from_pretrained('distilroberta-base')
        text = "Hello there"
        enc = tokenizer.encode_plus(text)
        enc.keys()

Out[1]: dict_keys(['input_ids', 'attention_mask'])
Run Code Online (Sandbox Code Playgroud)

但是,如果我使用保存它tokenizer.save_pretrained("distilroberta-tokenizer")并尝试在本地加载它,则会失败。

[2]:    tmp = AutoTokenizer.from_pretrained('distilroberta-tokenizer')


---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
/opt/conda/lib/python3.7/site-packages/transformers/configuration_utils.py in get_config_dict(cls, pretrained_model_name_or_path, **kwargs)
    238                 resume_download=resume_download,
--> 239                 local_files_only=local_files_only,
    240             )

/opt/conda/lib/python3.7/site-packages/transformers/file_utils.py in cached_path(url_or_filename, cache_dir, force_download, proxies, resume_download, user_agent, extract_compressed_file, force_extract, local_files_only)
    266         # File, but it doesn't exist.
--> 267         raise EnvironmentError("file {} not found".format(url_or_filename))
    268     else:

OSError: file distilroberta-tokenizer/config.json not found

During handling of the above exception, another exception occurred:

OSError                                   Traceback (most recent call last)
<ipython-input-25-3bd2f7a79271> in <module>
----> 1 tmp = AutoTokenizer.from_pretrained("distilroberta-tokenizer")

/opt/conda/lib/python3.7/site-packages/transformers/tokenization_auto.py in from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs)
    193         config = kwargs.pop("config", None)
    194         if not isinstance(config, PretrainedConfig):
--> 195             config = AutoConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)
    196 
    197         if "bert-base-japanese" in pretrained_model_name_or_path:

/opt/conda/lib/python3.7/site-packages/transformers/configuration_auto.py in from_pretrained(cls, pretrained_model_name_or_path, **kwargs)
    194 
    195         """
--> 196         config_dict, _ = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
    197 
    198         if "model_type" in config_dict:

/opt/conda/lib/python3.7/site-packages/transformers/configuration_utils.py in get_config_dict(cls, pretrained_model_name_or_path, **kwargs)
    250                 f"- or '{pretrained_model_name_or_path}' is the correct path to a directory containing a {CONFIG_NAME} file\n\n"
    251             )
--> 252             raise EnvironmentError(msg)
    253 
    254         except json.JSONDecodeError:

OSError: Can't load config for 'distilroberta-tokenizer'. Make sure that:

- 'distilroberta-tokenizer' is a correct model identifier listed on 'https://huggingface.co/models'

- or 'distilroberta-tokenizer' is the correct path to a directory containing a config.json file

Run Code Online (Sandbox Code Playgroud)

目录中缺少“config.josn”。在检查目录时,我得到这些文件的列表:

[3]:    !ls distilroberta-tokenizer

Out[3]: merges.txt  special_tokens_map.json  tokenizer_config.json  vocab.json
Run Code Online (Sandbox Code Playgroud)

我知道这个问题之前已经发布过,但似乎都不起作用。我也尝试按照文档进行操作,但仍然无法使其工作。
任何帮助,将不胜感激。

cro*_*oik 7

目前正在调查一个问题,该问题仅影响 AutoTokenizer,但不影响底层标记生成器,例如 (RobertaTokenizer)。例如,以下内容应该有效:

from transformers import RobertaTokenizer

tokenizer = RobertaTokenizer.from_pretrained('YOURPATH')
Run Code Online (Sandbox Code Playgroud)

要使用 AutoTokenizer,您还需要保存配置以离线加载它:

from transformers import AutoTokenizer, AutoConfig

tokenizer = AutoTokenizer.from_pretrained('distilroberta-base')
config = AutoConfig.from_pretrained('distilroberta-base')

tokenizer.save_pretrained('YOURPATH')
config.save_pretrained('YOURPATH')

tokenizer = AutoTokenizer.from_pretrained('YOURPATH')
Run Code Online (Sandbox Code Playgroud)

我建议对分词器和模型使用不同的路径或者保留模型的 config.json,因为应用于模型的一些修改将存储在 config.json 中,该修改是在创建过程中创建的model.save_pretrained(),并且在您使用时将被覆盖。如上所述,在模型之后保存标记生成器(即您将无法使用标记生成器 config.json 加载修改后的模型)。

  • 这是文档的[链接](https://huggingface.co/transformers/internal/tokenization_utils.html#transformers.tokenization_utils_base.PreTrainedTokenizerBase.save_pretrained)。仅当您手动添加令牌时才会保存 displayed_tokens.json。我认为他们没有检查他们在做什么,只是打印可以创建哪些文件。@SrikarManthatti (2认同)

use*_*533 5

我在您的代码中发现了几个问题,如下所示:

  1. distilroberta-tokenizer 是包含词汇配置等文件的目录。请确保首先创建此目录。

  2. 如果此目录包含 config.json 而不是 tokenizer_config.json,则使用 AutoTokenizer 有效。因此,请重命名该文件。

我修改了下面的代码并且它有效。

dir_name = "distilroberta-tokenizer"

if os.path.isdir(dir_name) == False:
    os.mkdir(dir_name)  

tokenizer.save_pretrained(dir_name)

#Rename config file now

#tmp = AutoTokenizer.from_pretrained(dir_name)   
Run Code Online (Sandbox Code Playgroud)

我希望这有帮助!

谢谢!