标签: huggingface-tokenizers

BERT 获取句子嵌入

我正在复制此页面的代码。我已将 BERT 模型下载到本地系统并获取句子嵌入。

我有大约 500,000 个句子需要句子嵌入,这需要花费很多时间。

  1. 有没有办法加快这个过程?
  2. 发送一批句子而不是一次发送一个句子会有帮助吗?

#!pip install transformers
import torch
import transformers
from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased',
                                  output_hidden_states = True, # Whether the model returns all hidden-states.
                                  )

# Put the model in "evaluation" mode, meaning feed-forward operation.
model.eval()

corpa=["i am a boy","i live in a city"]



storage=[]#list to store all embeddings

for text in corpa:
    # Add the special tokens.
    marked_text = "[CLS] " + text + …
Run Code Online (Sandbox Code Playgroud)

python nlp bert-language-model huggingface-transformers huggingface-tokenizers

6
推荐指数
2
解决办法
1万
查看次数

变压器 AutoTokenizer.tokenize 引入额外字符

我正在使用 HuggingFace 转换器 AutoTokenizer 来标记小段文本。然而,这种标记化在单词中间错误地分割,并向标记引入了 # 字符。我尝试了几种不同的模型,但结果相同。

以下是一段文本以及根据该文本创建的标记的示例。

CTO at TLR Communications Pty Ltd
['[CLS]', 'CT', '##O', 'at', 'T', '##LR', 'Communications', 'P', '##ty', 'Ltd', '[SEP]']
Run Code Online (Sandbox Code Playgroud)

这是我用来生成令牌的代码

tokenizer = AutoTokenizer.from_pretrained("tokenizer_bert.json")
tokens = tokenizer.tokenize(tokenizer.decode(tokenizer.encode(sequence)))
Run Code Online (Sandbox Code Playgroud)

python huggingface-transformers huggingface-tokenizers

6
推荐指数
1
解决办法
1940
查看次数

抱脸总结

我正在用变形金刚练习总结文本。按照以下教程:https : //huggingface.co/transformers/usage.html#summarization

from transformers import pipeline

summarizer = pipeline("summarization")

ARTICLE = """ New York (CNN)When Liana Barrientos was 23 years old, she got married in Westchester County, New York.
A year later, she got married again in Westchester County, but to a different man and without divorcing her first husband.
Only 18 days after that marriage, she got hitched yet again. Then, Barrientos declared "I do" five more times, sometimes only within two weeks of each other.
In …
Run Code Online (Sandbox Code Playgroud)

huggingface-transformers huggingface-tokenizers

5
推荐指数
1
解决办法
2271
查看次数

Hugging-Face Transformers:从路径错误加载模型

我对 Hugging-Face 变压器很陌生。当我尝试从给定路径加载xlm-roberta-base模型时,我面临以下问题:

>> tokenizer = AutoTokenizer.from_pretrained(model_path)
>> Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/user/anaconda3/lib/python3.7/site-packages/transformers/tokenization_auto.py", line 182, in from_pretrained
    return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
  File "/home/user/anaconda3/lib/python3.7/site-packages/transformers/tokenization_utils.py", line 309, in from_pretrained
    return cls._from_pretrained(*inputs, **kwargs)
  File "/home/user/anaconda3/lib/python3.7/site-packages/transformers/tokenization_utils.py", line 458, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
  File "/home/user/anaconda3/lib/python3.7/site-packages/transformers/tokenization_roberta.py", line 98, in __init__
    **kwargs,
  File "/home/user/anaconda3/lib/python3.7/site-packages/transformers/tokenization_gpt2.py", line 133, in __init__
    with open(vocab_file, encoding="utf-8") as vocab_handle:
TypeError: expected str, bytes or os.PathLike object, not NoneType
Run Code Online (Sandbox Code Playgroud)

但是,如果我按其名称加载它,则没有问题:

>> tokenizer …
Run Code Online (Sandbox Code Playgroud)

huggingface-transformers huggingface-tokenizers

5
推荐指数
1
解决办法
1621
查看次数

如何使用 HuggingFace 将中文翻译成英文?

我想使用 HuggingFace 的转换器使用预训练"xlm-mlm-xnli15-1024"模型将中文翻译成英文。本教程展示了如何从英语到德语。

\n

我尝试按照教程进行操作,但它没有详细说明如何手动更改语言或解码结果。我不知道从哪里开始。抱歉,这个问题不能更具体了。

\n

这是我尝试过的:

\n
from transformers import AutoModelWithLMHead, AutoTokenizer\nbase_model = "xlm-mlm-xnli15-1024"\nmodel = AutoModelWithLMHead.from_pretrained(base_model)\ntokenizer = AutoTokenizer.from_pretrained(base_model)\n\ninputs = tokenizer.encode("translate English to Chinese: Hugging Face is a technology company based in New York and Paris", return_tensors="pt")\noutputs = model.generate(inputs, max_length=40, num_beams=4, early_stopping=True)\n\nprint(tokenizer.decode(outputs.tolist()[0]))\n
Run Code Online (Sandbox Code Playgroud)\n
\'<s>translate english to chinese : hugging face is a technology company based in new york and paris </s>china hug \xe2\x84\xa2 \xe2\x84\xa2 \xe2\x84\xa2 \xe2\x84\xa2 \xe2\x84\xa2 \xe2\x84\xa2 \xe2\x84\xa2 \xe2\x84\xa2 \xe2\x84\xa2 \xe2\x84\xa2 \xe2\x84\xa2 \xe2\x84\xa2 \xe2\x84\xa2 \xe2\x84\xa2 …
Run Code Online (Sandbox Code Playgroud)

translation nlp machine-translation huggingface-transformers huggingface-tokenizers

5
推荐指数
2
解决办法
6738
查看次数

推理后如何将标记化的单词转换回原始单词?

我正在为已经训练好的 NER 模型编写推理脚本,但我在将编码标记(它们的 id)转换为原始单词时遇到了麻烦。

# example input
df = pd.DataFrame({'_id': [1], 'body': ['Amazon and Tesla are currently the best picks out there!']})

# calling method that handles inference:
ner_model = NER()
ner_model.recognize_from_df(df, 'body')

# here is only part of larger NER class that handles the inference:
def recognize_from_df(self, df: pd.DataFrame, input_col: str):
    predictions = []
    df = df[['_id', input_col]].copy()
    dataset = Dataset.from_pandas(df)
    # tokenization, padding, truncation:
    encoded_dataset = dataset.map(lambda examples: self.bert_tokenizer(examples[input_col], 
                                      padding='max_length', truncation=True, max_length=512), batched=True)
    encoded_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask'], device=device)
    dataloader …
Run Code Online (Sandbox Code Playgroud)

python pytorch huggingface-transformers huggingface-tokenizers huggingface-datasets

5
推荐指数
1
解决办法
1624
查看次数

HuggingFace 自动标记器 | ValueError:无法实例化后端分词器

目标:修改此笔记本以与albert-base-v2模型一起使用

1.3 节中出现错误。

核心:conda_pytorch_p36。我重新启动并运行全部,并刷新了工作目录中的文件视图。


列出了 3 种可能导致此错误的方式。我不确定我的情况属于哪一种情况。

第 1.3 节:

# define the tokenizer
tokenizer = AutoTokenizer.from_pretrained(
        configs.output_dir, do_lower_case=configs.do_lower_case)
Run Code Online (Sandbox Code Playgroud)

追溯:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-25-1f864e3046eb> in <module>
    140 # define the tokenizer
    141 tokenizer = AutoTokenizer.from_pretrained(
--> 142         configs.output_dir, do_lower_case=configs.do_lower_case)
    143 
    144 # Evaluate the original FP32 BERT model

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/transformers/models/auto/tokenization_auto.py in from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs)
    548             tokenizer_class_py, tokenizer_class_fast = TOKENIZER_MAPPING[type(config)]
    549             if tokenizer_class_fast and (use_fast or tokenizer_class_py is None):
--> 550 …
Run Code Online (Sandbox Code Playgroud)

python tensorflow onnx huggingface-transformers huggingface-tokenizers

5
推荐指数
1
解决办法
9832
查看次数

如何处理layoutLMV3中超过512个标记的序列?

如何处理长度超过 512 个标记的序列。我不想使用截断= True。但实际上想要处理更长的序列

transformer-model huggingface-tokenizers huggingface

5
推荐指数
1
解决办法
2161
查看次数

如何使用中心的数据集微调 Huggingface Seq2Seq 模型?

我想"flax-community/t5-large-wikisplit"用数据集训练模型"dxiao/requirements-ner-id"。(仅用于一些实验)

我认为我的一般程序不正确,但我不知道如何进一步。

我的代码:

加载分词器和模型:

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, AutoModel
checkpoint = "flax-community/t5-large-wikisplit"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint).cuda()
Run Code Online (Sandbox Code Playgroud)

加载我想要训练的数据集:

from datasets import load_dataset
raw_dataset = load_dataset("dxiao/requirements-ner-id")
Run Code Online (Sandbox Code Playgroud)

raw_dataset 看起来像这样 ['id', 'tokens', 'tags', 'ner_tags']

我想将句子作为句子而不是标记。

def tokenToString(tokenarray):
  string = tokenarray[0]
  for x in tokenarray[1:]:
    string += " " + x
  return string

def sentence_function(example):
  return {"sentence" :  tokenToString(example["tokens"]),
          "simplefiedSentence" : tokenToString(example["tokens"]).replace("The", "XXXXXXXXXXX")}

wikisplit_req_set = raw_dataset.map(sentence_function)
wikisplit_req_set
Run Code Online (Sandbox Code Playgroud)

我尝试重构数据集,使其看起来像 wikisplit 数据集:

simple1dataset = wikisplit_req_set.remove_columns(['id', 'tags', 'ner_tags', 'tokens']);
complexdataset = …
Run Code Online (Sandbox Code Playgroud)

python nlp huggingface-transformers huggingface-tokenizers huggingface

5
推荐指数
1
解决办法
3617
查看次数

有没有办法保存预编译的 AutoTokenizer?

有时,我们必须这样做来扩展预先训练的分词器:

from transformers import AutoTokenizer

from datasets import load_dataset


ds_de = load_dataset("mc4", 'de')
ds_fr = load_dataset("mc4", 'fr')

de_tokenizer = tokenizer.train_new_from_iterator(
    ds_de['text'],vocab_size=50_000
)

fr_tokenizer = tokenizer.train_new_from_iterator(
    ds_fr['text'],vocab_size=50_000
)

new_tokens_de = set(de_tokenizer.vocab).difference(tokenizer.vocab)
new_tokens_fr = set(fr_tokenizer.vocab).difference(tokenizer.vocab)
new_tokens = set(new_tokens_de).union(new_tokens_fr)


tokenizer = AutoTokenizer.from_pretrained(
    'moussaKam/frugalscore_tiny_bert-base_bert-score'
)

tokenizer.add_tokens(list(new_tokens))

tokenizer.save_pretrained('frugalscore_tiny_bert-de-fr')
Run Code Online (Sandbox Code Playgroud)

然后在加载分词器时,

tokenizer = AutoTokenizer.from_pretrained(
  'frugalscore_tiny_bert-de-fr', local_files_only=True
)
Run Code Online (Sandbox Code Playgroud)

%%time从Jupyter 单元中加载需要很长时间:

CPU times: user 34min 20s
Wall time: 34min 22s
Run Code Online (Sandbox Code Playgroud)

我猜这是由于添加的令牌的正则表达式编译所致,这也在https://github.com/huggingface/tokenizers/issues/914中提出

我认为没关系,因为它会加载一次,并且无需重新进行正则表达式编译即可完成工作。

但是,有没有办法以二进制形式保存标记生成器并避免下次进行整个正则表达式编译?

python nlp tokenize huggingface-tokenizers huggingface

5
推荐指数
1
解决办法
206
查看次数