我正在尝试使用我自己的标记生成器从 Huggingface 示例运行语言模型微调脚本(run_language_modeling.py)(刚刚添加了几个标记,请参阅评论)。我在加载分词器时遇到问题。我认为问题出在 AutoTokenizer.from_pretrained('local/path/to/directory') 上。
代码:
from transformers import *
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
# special_tokens = ['<HASHTAG>', '<URL>', '<AT_USER>', '<EMOTICON-HAPPY>', '<EMOTICON-SAD>']
# tokenizer.add_tokens(special_tokens)
tokenizer.save_pretrained('../twitter/twittertokenizer/')
tmp = AutoTokenizer.from_pretrained('../twitter/twittertokenizer/')
Run Code Online (Sandbox Code Playgroud)
错误信息:
OSError Traceback (most recent call last)
/z/huggingface_venv/lib/python3.7/site-packages/transformers/configuration_utils.py in get_config_dict(cls, pretrained_model_name_or_path, pretrained_config_archive_map, **kwargs)
248 resume_download=resume_download,
--> 249 local_files_only=local_files_only,
250 )
/z/huggingface_venv/lib/python3.7/site-packages/transformers/file_utils.py in cached_path(url_or_filename, cache_dir, force_download, proxies, resume_download, user_agent, extract_compressed_file, force_extract, local_files_only)
265 # File, but it doesn't exist.
--> 266 raise EnvironmentError("file {} not found".format(url_or_filename))
267 else:
OSError: file ../twitter/twittertokenizer/config.json not found
During …Run Code Online (Sandbox Code Playgroud) 我正在尝试读取和处理一个大的 json 文件(~16G),但即使我通过指定 chunksize=500 读取小块,它仍然有内存错误。我的代码:
i=0
header = True
for chunk in pd.read_json('filename.json.tsv', lines=True, chunksize=500):
print("Processing chunk ", i)
process_chunk(chunk, i)
i+=1
header = False
Run Code Online (Sandbox Code Playgroud)
def process_chunk(chunk, header, i):
pk_file = 'data/pk_files/500_chunk_'+str(i)+'.pk'
get_data_pk(chunk, pk_file) #load and process some columns and save into a pk file for future processing
preds = get_preds(pk_file) #SVM prediction
chunk['prediction'] = preds #append result column
chunk.to_csv('result.csv', header = header, mode='a')
Run Code Online (Sandbox Code Playgroud)
process_chunk 函数基本上读取每个块并向其附加一个新列。
当我使用较小的文件时,它也能正常工作,如果我在 read_json 函数中指定 nrows=5000 也能正常工作。似乎出于某种原因,尽管有 chunksize 参数,它仍然需要完整的文件大小内存。
任何的想法?谢谢!