我正在复制此页面的代码。我已将 BERT 模型下载到本地系统并获取句子嵌入。
我有大约 500,000 个句子需要句子嵌入,这需要花费很多时间。
。
#!pip install transformers
import torch
import transformers
from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased',
output_hidden_states = True, # Whether the model returns all hidden-states.
)
# Put the model in "evaluation" mode, meaning feed-forward operation.
model.eval()
corpa=["i am a boy","i live in a city"]
storage=[]#list to store all embeddings
for text in corpa:
# Add the special tokens.
marked_text = "[CLS] " + text + …
Run Code Online (Sandbox Code Playgroud) python nlp bert-language-model huggingface-transformers huggingface-tokenizers
我正在使用 HuggingFace 转换器 AutoTokenizer 来标记小段文本。然而,这种标记化在单词中间错误地分割,并向标记引入了 # 字符。我尝试了几种不同的模型,但结果相同。
以下是一段文本以及根据该文本创建的标记的示例。
CTO at TLR Communications Pty Ltd
['[CLS]', 'CT', '##O', 'at', 'T', '##LR', 'Communications', 'P', '##ty', 'Ltd', '[SEP]']
Run Code Online (Sandbox Code Playgroud)
这是我用来生成令牌的代码
tokenizer = AutoTokenizer.from_pretrained("tokenizer_bert.json")
tokens = tokenizer.tokenize(tokenizer.decode(tokenizer.encode(sequence)))
Run Code Online (Sandbox Code Playgroud) 我正在用变形金刚练习总结文本。按照以下教程:https : //huggingface.co/transformers/usage.html#summarization
from transformers import pipeline
summarizer = pipeline("summarization")
ARTICLE = """ New York (CNN)When Liana Barrientos was 23 years old, she got married in Westchester County, New York.
A year later, she got married again in Westchester County, but to a different man and without divorcing her first husband.
Only 18 days after that marriage, she got hitched yet again. Then, Barrientos declared "I do" five more times, sometimes only within two weeks of each other.
In …
Run Code Online (Sandbox Code Playgroud) 我对 Hugging-Face 变压器很陌生。当我尝试从给定路径加载xlm-roberta-base模型时,我面临以下问题:
>> tokenizer = AutoTokenizer.from_pretrained(model_path)
>> Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/user/anaconda3/lib/python3.7/site-packages/transformers/tokenization_auto.py", line 182, in from_pretrained
return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
File "/home/user/anaconda3/lib/python3.7/site-packages/transformers/tokenization_utils.py", line 309, in from_pretrained
return cls._from_pretrained(*inputs, **kwargs)
File "/home/user/anaconda3/lib/python3.7/site-packages/transformers/tokenization_utils.py", line 458, in _from_pretrained
tokenizer = cls(*init_inputs, **init_kwargs)
File "/home/user/anaconda3/lib/python3.7/site-packages/transformers/tokenization_roberta.py", line 98, in __init__
**kwargs,
File "/home/user/anaconda3/lib/python3.7/site-packages/transformers/tokenization_gpt2.py", line 133, in __init__
with open(vocab_file, encoding="utf-8") as vocab_handle:
TypeError: expected str, bytes or os.PathLike object, not NoneType
Run Code Online (Sandbox Code Playgroud)
但是,如果我按其名称加载它,则没有问题:
>> tokenizer …
Run Code Online (Sandbox Code Playgroud) 我想使用 HuggingFace 的转换器使用预训练"xlm-mlm-xnli15-1024"
模型将中文翻译成英文。本教程展示了如何从英语到德语。
我尝试按照教程进行操作,但它没有详细说明如何手动更改语言或解码结果。我不知道从哪里开始。抱歉,这个问题不能更具体了。
\n这是我尝试过的:
\nfrom transformers import AutoModelWithLMHead, AutoTokenizer\nbase_model = "xlm-mlm-xnli15-1024"\nmodel = AutoModelWithLMHead.from_pretrained(base_model)\ntokenizer = AutoTokenizer.from_pretrained(base_model)\n\ninputs = tokenizer.encode("translate English to Chinese: Hugging Face is a technology company based in New York and Paris", return_tensors="pt")\noutputs = model.generate(inputs, max_length=40, num_beams=4, early_stopping=True)\n\nprint(tokenizer.decode(outputs.tolist()[0]))\n
Run Code Online (Sandbox Code Playgroud)\n\'<s>translate english to chinese : hugging face is a technology company based in new york and paris </s>china hug \xe2\x84\xa2 \xe2\x84\xa2 \xe2\x84\xa2 \xe2\x84\xa2 \xe2\x84\xa2 \xe2\x84\xa2 \xe2\x84\xa2 \xe2\x84\xa2 \xe2\x84\xa2 \xe2\x84\xa2 \xe2\x84\xa2 \xe2\x84\xa2 \xe2\x84\xa2 \xe2\x84\xa2 …
Run Code Online (Sandbox Code Playgroud) translation nlp machine-translation huggingface-transformers huggingface-tokenizers
我正在为已经训练好的 NER 模型编写推理脚本,但我在将编码标记(它们的 id)转换为原始单词时遇到了麻烦。
# example input
df = pd.DataFrame({'_id': [1], 'body': ['Amazon and Tesla are currently the best picks out there!']})
# calling method that handles inference:
ner_model = NER()
ner_model.recognize_from_df(df, 'body')
# here is only part of larger NER class that handles the inference:
def recognize_from_df(self, df: pd.DataFrame, input_col: str):
predictions = []
df = df[['_id', input_col]].copy()
dataset = Dataset.from_pandas(df)
# tokenization, padding, truncation:
encoded_dataset = dataset.map(lambda examples: self.bert_tokenizer(examples[input_col],
padding='max_length', truncation=True, max_length=512), batched=True)
encoded_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask'], device=device)
dataloader …
Run Code Online (Sandbox Code Playgroud) python pytorch huggingface-transformers huggingface-tokenizers huggingface-datasets
目标:修改此笔记本以与albert-base-v2模型一起使用
1.3 节中出现错误。
核心:conda_pytorch_p36
。我重新启动并运行全部,并刷新了工作目录中的文件视图。
列出了 3 种可能导致此错误的方式。我不确定我的情况属于哪一种情况。
第 1.3 节:
# define the tokenizer
tokenizer = AutoTokenizer.from_pretrained(
configs.output_dir, do_lower_case=configs.do_lower_case)
Run Code Online (Sandbox Code Playgroud)
追溯:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-25-1f864e3046eb> in <module>
140 # define the tokenizer
141 tokenizer = AutoTokenizer.from_pretrained(
--> 142 configs.output_dir, do_lower_case=configs.do_lower_case)
143
144 # Evaluate the original FP32 BERT model
~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/transformers/models/auto/tokenization_auto.py in from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs)
548 tokenizer_class_py, tokenizer_class_fast = TOKENIZER_MAPPING[type(config)]
549 if tokenizer_class_fast and (use_fast or tokenizer_class_py is None):
--> 550 …
Run Code Online (Sandbox Code Playgroud) python tensorflow onnx huggingface-transformers huggingface-tokenizers
如何处理长度超过 512 个标记的序列。我不想使用截断= True。但实际上想要处理更长的序列
我想"flax-community/t5-large-wikisplit"
用数据集训练模型"dxiao/requirements-ner-id"
。(仅用于一些实验)
我认为我的一般程序不正确,但我不知道如何进一步。
我的代码:
加载分词器和模型:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, AutoModel
checkpoint = "flax-community/t5-large-wikisplit"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint).cuda()
Run Code Online (Sandbox Code Playgroud)
加载我想要训练的数据集:
from datasets import load_dataset
raw_dataset = load_dataset("dxiao/requirements-ner-id")
Run Code Online (Sandbox Code Playgroud)
raw_dataset 看起来像这样 ['id', 'tokens', 'tags', 'ner_tags']
我想将句子作为句子而不是标记。
def tokenToString(tokenarray):
string = tokenarray[0]
for x in tokenarray[1:]:
string += " " + x
return string
def sentence_function(example):
return {"sentence" : tokenToString(example["tokens"]),
"simplefiedSentence" : tokenToString(example["tokens"]).replace("The", "XXXXXXXXXXX")}
wikisplit_req_set = raw_dataset.map(sentence_function)
wikisplit_req_set
Run Code Online (Sandbox Code Playgroud)
我尝试重构数据集,使其看起来像 wikisplit 数据集:
simple1dataset = wikisplit_req_set.remove_columns(['id', 'tags', 'ner_tags', 'tokens']);
complexdataset = …
Run Code Online (Sandbox Code Playgroud) python nlp huggingface-transformers huggingface-tokenizers huggingface
有时,我们必须这样做来扩展预先训练的分词器:
from transformers import AutoTokenizer
from datasets import load_dataset
ds_de = load_dataset("mc4", 'de')
ds_fr = load_dataset("mc4", 'fr')
de_tokenizer = tokenizer.train_new_from_iterator(
ds_de['text'],vocab_size=50_000
)
fr_tokenizer = tokenizer.train_new_from_iterator(
ds_fr['text'],vocab_size=50_000
)
new_tokens_de = set(de_tokenizer.vocab).difference(tokenizer.vocab)
new_tokens_fr = set(fr_tokenizer.vocab).difference(tokenizer.vocab)
new_tokens = set(new_tokens_de).union(new_tokens_fr)
tokenizer = AutoTokenizer.from_pretrained(
'moussaKam/frugalscore_tiny_bert-base_bert-score'
)
tokenizer.add_tokens(list(new_tokens))
tokenizer.save_pretrained('frugalscore_tiny_bert-de-fr')
Run Code Online (Sandbox Code Playgroud)
然后在加载分词器时,
tokenizer = AutoTokenizer.from_pretrained(
'frugalscore_tiny_bert-de-fr', local_files_only=True
)
Run Code Online (Sandbox Code Playgroud)
%%time
从Jupyter 单元中加载需要很长时间:
CPU times: user 34min 20s
Wall time: 34min 22s
Run Code Online (Sandbox Code Playgroud)
我猜这是由于添加的令牌的正则表达式编译所致,这也在https://github.com/huggingface/tokenizers/issues/914中提出
我认为没关系,因为它会加载一次,并且无需重新进行正则表达式编译即可完成工作。