在 HuggingFace 中,每次调用pipeline()对象时,我都会收到警告:
`"Setting `pad_token_id` to `eos_token_id`:{eos_token_id} for open-end generation."
如何在不抑制所有日志记录警告的情况下抑制此警告?我想要其他警告,但我不想要这个。
我正在处理文本分类问题,我想使用 BERT 模型作为基础,然后使用密集层。我想知道这 3 个参数是如何工作的?例如,如果我有 3 个句子:
'My name is slim shade and I am an aspiring AI Engineer',
'I am an aspiring AI Engineer',
'My name is Slim'
那么这 3 个参数会做什么呢?我的想法如下:
max_length=5将严格保留长度为 5 之前的所有句子padding=max_length将为第三句添加 1 的填充truncate=True将截断第一句和第二句,使其长度严格为 5。如果我错了,请纠正我。
下面是我使用过的代码。
! pip install transformers==3.5.1
from transformers import BertTokenizerFast
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')
tokens = tokenizer.batch_encode_plus(text,max_length=5,padding='max_length', truncation=True)
  
text_seq = torch.tensor(tokens['input_ids'])
text_mask = torch.tensor(tokens['attention_mask'])
python deep-learning pytorch bert-language-model huggingface-tokenizers
我目前使用 Huggingface 管道进行情感分析,如下所示:
from transformers import pipeline
classifier = pipeline('sentiment-analysis', device=0)
问题是,当我传递大于 512 个标记的文本时,它会崩溃并提示输入太长。有没有办法将 max_length 和 truncate 参数从分词器直接传递到管道?
我的工作是做:
从转换器导入 AutoTokenizer、AutoModelForSequenceClassification
model_name = "nlptown/bert-base-multilingual-uncased-sentiment"
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
classifier = pipeline('sentiment-analysis', model=model, tokenizer=tokenizer, device=0)
然后当我调用标记器时:
pt_batch = tokenizer(text, padding=True, truncation=True, max_length=512, return_tensors="pt")
但如果能够像这样直接调用管道会更好:
classifier(text, padding=True, truncation=True, max_length=512)
我在从 HuggingFace 加载预训练模型时遇到以下问题。
HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /roberta-base/resolve/main/config.json (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1125)')))
导致问题的行是
tokenizer = AutoTokenizer.from_pretrained('roberta-base')
我以前从未遇到过这个问题,而且之前工作得非常好。我一无所知。
python-3.x tensorflow2.0 huggingface-transformers huggingface-tokenizers
我正在使用 SentenceTransformers 库(此处: https: //pypi.org/project/sentence-transformers/#pretrained-models)使用预训练模型 bert-base-nli-mean-tokens 创建句子嵌入。我有一个应用程序将部署到无法访问互联网的设备。到这里,已经回答了,如何保存模型下载预训练的BERT模型到本地。然而我一直坚持从本地保存的路径加载保存的模型。
当我尝试使用上述技术保存模型时,这些是输出文件:
('/bert-base-nli-mean-tokens/tokenizer_config.json',
 '/bert-base-nli-mean-tokens/special_tokens_map.json',
 '/bert-base-nli-mean-tokens/vocab.txt',
 '/bert-base-nli-mean-tokens/added_tokens.json')
当我尝试将其加载到内存中时,使用
tokenizer = AutoTokenizer.from_pretrained(to_save_path)
我越来越
Can't load config for '/bert-base-nli-mean-tokens'. Make sure that:
- '/bert-base-nli-mean-tokens' is a correct model identifier listed on 'https://huggingface.co/models'
- or '/bert-base-nli-mean-tokens' is the correct path to a directory containing a config.json 
word-embedding bert-language-model huggingface-tokenizers sentence-transformers
我试图借助拥抱面部情绪分析预训练模型来获取评论的情绪。它返回错误,就像Token indices sequence length is longer than the specified maximum sequence length for this model (651 > 512)拥抱面部情感分类器一样。
下面我附上代码请看一下
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
import transformers
import pandas as pd
model = AutoModelForSequenceClassification.from_pretrained('/content/drive/MyDrive/Huggingface-Sentiment-Pipeline')
token = AutoTokenizer.from_pretrained('/content/drive/MyDrive/Huggingface-Sentiment-Pipeline')
classifier = pipeline(task='sentiment-analysis', model=model, tokenizer=token)
data = pd.read_csv('/content/drive/MyDrive/DisneylandReviews.csv', encoding='latin-1')
data.head()
输出是
    Review
0   If you've ever been to Disneyland anywhere you...
1   Its been a while since d last time we visit HK...
2   Thanks God it wasn t too hot …nlp sentiment-analysis deep-learning huggingface-transformers huggingface-tokenizers
我正在使用
AutoModelForCausalLM和AutoTokenizer来生成文本输出DialoGPT。
无论出于何种原因,即使使用 Huggingface 提供的示例,我也会收到此警告:
正在使用仅解码器架构,但检测到右填充!为了正确的生成结果,请
padding_side='left'在初始化分词器时进行设置。
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
tokenizer = AutoTokenizer.from_pretrained("microsoft/DialoGPT-medium")
model = AutoModelForCausalLM.from_pretrained("microsoft/DialoGPT-medium")
# Let's chat for 5 lines
for step in range(5):
    # encode the new user input, add the eos_token and return a tensor in Pytorch
    new_user_input_ids = tokenizer.encode(input(">> User:") + tokenizer.eos_token, return_tensors='pt')
    # append the new user input tokens to the chat history
    bot_input_ids = torch.cat([chat_history_ids, new_user_input_ids], dim=-1) if step > 0 else new_user_input_ids …python machine-learning huggingface-transformers huggingface-tokenizers
我使用pytorch来训练huggingface-transformers模型,但是每个epoch,总是输出警告:
The current process just got forked. Disabling parallelism to avoid deadlocks... To disable this warning, please explicitly set TOKENIZERS_PARALLELISM=(true | false)
如何禁用此警告?
“令牌”和“特殊令牌”到底有什么区别?
\n我了解以下内容:
\n我不明白的是,您想要在什么样的容量下创建新的特殊令牌,我们需要它的任何示例以及何时想要创建除默认特殊令牌之外的特殊令牌?如果一个示例使用特殊令牌,为什么普通令牌不能实现相同的目标?
\ntokenizer.add_tokens(['[EOT]'], special_tokens=True)\n而且我也不太明白源文档中的以下描述。\n如果我们将 add_special_tokens 设置为 False,这对我们的模型有什么区别?
\nadd_special_tokens (bool, optional, defaults to True) \xe2\x80\x94 Whether or not to encode the sequences with the special tokens relative to their model.\nnlp tokenize bert-language-model huggingface-transformers huggingface-tokenizers
由于 SSL 证书错误,我在从 HuggingFace 加载预训练的 BERT 模型时遇到以下问题。
SSLError: HTTPSConnectionPool(host='huggingface.co', port=443): 超过最大重试次数,网址为:/dslim/bert-base-NER/resolve/main/tokenizer_config.json (由 SSLError(SSLCertVerificationError(1, '[ SSL: CERTIFICATE_VERIFY_FAILED] 证书验证失败:证书链中的自签名证书 (_ssl.c:1108)')))
tokenizer = AutoTokenizer.from_pretrained("dslim/bert-base-NER")
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("dslim/bert-base-NER")
model = AutoModelForTokenClassification.from_pretrained("dslim/bert-base-NER")
我希望在 Windows 上的 jupyter 实验室中运行代码时下载预先训练的模型。
python-3.x bert-language-model huggingface-transformers huggingface-tokenizers huggingface