我使用pytorch来训练huggingface-transformers模型,但是每个epoch,总是输出警告:
The current process just got forked. Disabling parallelism to avoid deadlocks... To disable this warning, please explicitly set TOKENIZERS_PARALLELISM=(true | false)
Run Code Online (Sandbox Code Playgroud)
如何禁用此警告?
“令牌”和“特殊令牌”到底有什么区别?
\n我了解以下内容:
\n我不明白的是,您想要在什么样的容量下创建新的特殊令牌,我们需要它的任何示例以及何时想要创建除默认特殊令牌之外的特殊令牌?如果一个示例使用特殊令牌,为什么普通令牌不能实现相同的目标?
\ntokenizer.add_tokens(['[EOT]'], special_tokens=True)\nRun Code Online (Sandbox Code Playgroud)\n而且我也不太明白源文档中的以下描述。\n如果我们将 add_special_tokens 设置为 False,这对我们的模型有什么区别?
\nadd_special_tokens (bool, optional, defaults to True) \xe2\x80\x94 Whether or not to encode the sequences with the special tokens relative to their model.\nRun Code Online (Sandbox Code Playgroud)\n nlp tokenize bert-language-model huggingface-transformers huggingface-tokenizers
我想使用 BertForMaskedLM 或 BertModel 来计算句子的困惑度,所以我编写了这样的代码:
\nimport numpy as np\nimport torch\nimport torch.nn as nn\nfrom transformers import BertTokenizer, BertForMaskedLM\n# Load pre-trained model (weights)\nwith torch.no_grad():\n model = BertForMaskedLM.from_pretrained(\'hfl/chinese-bert-wwm-ext\')\n model.eval()\n # Load pre-trained model tokenizer (vocabulary)\n tokenizer = BertTokenizer.from_pretrained(\'hfl/chinese-bert-wwm-ext\')\n sentence = "\xe6\x88\x91\xe4\xb8\x8d\xe4\xbc\x9a\xe5\xbf\x98\xe8\xae\xb0\xe5\x92\x8c\xe4\xbd\xa0\xe4\xb8\x80\xe8\xb5\xb7\xe5\xa5\x8b\xe6\x96\x97\xe7\x9a\x84\xe6\x97\xb6\xe5\x85\x89\xe3\x80\x82"\n tokenize_input = tokenizer.tokenize(sentence)\n tensor_input = torch.tensor([tokenizer.convert_tokens_to_ids(tokenize_input)])\n sen_len = len(tokenize_input)\n sentence_loss = 0.\n\n for i, word in enumerate(tokenize_input):\n # add mask to i-th character of the sentence\n tokenize_input[i] = \'[MASK]\'\n mask_input = torch.tensor([tokenizer.convert_tokens_to_ids(tokenize_input)])\n\n output = model(mask_input)\n\n prediction_scores = output[0]\n softmax = nn.Softmax(dim=0)\n …Run Code Online (Sandbox Code Playgroud) nlp transformer-model pytorch bert-language-model huggingface-transformers
我正在关注这个页面。我加载了一个数据集并将其转换为 Pandas 数据框,然后转换回数据集。我无法匹配特征,因为数据集不匹配。如何设置新数据集的特征以使它们与旧数据集匹配?
import pandas as pd
import datasets
from transformers import LongformerTokenizerFast, LongformerForSequenceClassification, Trainer, TrainingArguments, LongformerConfig
import torch.nn as nn
import torch
from torch.utils.data import Dataset, DataLoader
import numpy as np
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
from tqdm import tqdm
#import wandb
import os
train_data_s1, test_data_s1 = datasets.load_dataset('imdb', split =['train[0:500]', 'test[0:500]'],
cache_dir='/media/data_files/github/website_tutorials/data')
print (type (train_data_s1))
#<class 'datasets.arrow_dataset.Dataset'>
#converting to pandas - https://towardsdatascience.com/use-the-datasets-library-of-hugging-face-in-your-next-nlp-project-94e300cca850
print (type(train_data_s1))
df_pandas = pd.DataFrame(train_data_s1)
print (type(df_pandas))
#<class 'datasets.arrow_dataset.Dataset'>
#<class 'pandas.core.frame.DataFrame'>
from datasets import Dataset …Run Code Online (Sandbox Code Playgroud) 我有一个大小为 4107 的火车数据集。
DatasetDict({
train: Dataset({
features: ['input_ids'],
num_rows: 4107
})
valid: Dataset({
features: ['input_ids'],
num_rows: 498
})
})
Run Code Online (Sandbox Code Playgroud)
在我的训练参数中,批量大小为 8,纪元数为 2。
from transformers import Trainer, TrainingArguments
args = TrainingArguments(
output_dir="code_gen_epoch",
per_device_train_batch_size=8,
per_device_eval_batch_size=8,
evaluation_strategy="epoch",
save_strategy="epoch",
eval_steps=100,
logging_steps=100,
gradient_accumulation_steps=8,
num_train_epochs=2,
weight_decay=0.1,
warmup_steps=1_000,
lr_scheduler_type="cosine",
learning_rate=3.0e-4,
# save_steps=200,
# fp16=True,
load_best_model_at_end = True,
)
trainer = Trainer(
model=model,
tokenizer=tokenizer,
args=args,
data_collator=data_collator,
train_dataset=tokenized_datasets["train"],
eval_dataset=tokenized_datasets["valid"],
)
Run Code Online (Sandbox Code Playgroud)
当我开始训练时,我可以看到步数是128。
我的假设是 1 epoch 的步数应该是 4107/8 = 512(大约)。2 个纪元 512+512 = 1024。
不明白怎么变成128了
我有以下代码
import transformers
from transformers import pipeline
# Load the language model pipeline
model = pipeline("text-generation", model="gpt2")
# Input sentence for generating next word predictions
input_sentence = "I enjoy walking in the"
Run Code Online (Sandbox Code Playgroud)
我只想在给定输入句子的情况下生成下一个单词,但我想查看所有可能的下一个单词及其概率的列表。任何其他LLM都可以使用,我以gpt2为例。
在代码中,我想仅为下一个单词选择前 500 个单词或前 1000 个单词建议以及每个建议单词的概率,我该怎么做?
我有一个句子,我需要返回特定单词左右两侧 N 个 BERT 标记对应的文本。
from transformers import BertTokenizer
tz = BertTokenizer.from_pretrained("bert-base-cased")
sentence = "The Natural Science Museum of Madrid shows the RECONSTRUCTION of a dinosaur"
tokens = tz.tokenize(sentence)
print(tokens)
>>['The', 'Natural', 'Science', 'Museum', 'of', 'Madrid', 'shows', 'the', 'R', '##EC', '##ON', '##ST', '##R', '##UC', '##TI', '##ON', 'of', 'a', 'dinosaur']
Run Code Online (Sandbox Code Playgroud)
我想要的是获取与令牌马德里左侧和右侧的4个令牌相对应的文本。所以我想要标记: ['Natural', 'Science', 'Museum', 'of', 'Madrid', 'shows', 'the', 'R', '##EC'] 然后将它们转换为原文。在本例中,它将是“马德里自然科学博物馆展示 REC”。
有没有办法做到这一点?
python tokenize bert-language-model huggingface-transformers huggingface-tokenizers
我收到一条错误消息,指出输入应该是张量类型,而不是元组类型。我不知道如何解决这个问题,因为我已经实现了迁移计划中所述的 return_dict=False 方法。
我的模型如下:
class XLNetClassifier(torch.nn.Module):
def __init__(self, dropout_rate=0.1):
super(XLNetClassifier, self).__init__()
self.XLNet = XLNetModel.from_pretrained('xlnet-base-cased', return_dict=False)
self.d1 = torch.nn.Dropout(dropout_rate)
self.l1 = torch.nn.Linear(768, 64)
self.bn1 = torch.nn.LayerNorm(64)
self.d2 = torch.nn.Dropout(dropout_rate)
self.l2 = torch.nn.Linear(64, 3)
def forward(self, input_ids, attention_mask):
x = self.XLNet(input_ids=input_ids, attention_masks = attention_mask)
x = self.d1(x)
x = self.l1(x)
x = self.bn1(x)
x = torch.nn.Tanh()(x)
x = self.d2(x)
x = self.l2(x)
return x
Run Code Online (Sandbox Code Playgroud)
调用dropout时出现错误。
我按以下方式使用 Huggingface 的句子 BERT:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
model.max_seq_length = 512
model.encode(text)
Run Code Online (Sandbox Code Playgroud)
当text很长且包含超过 512 个标记时,不会抛出异常。我假设它会自动将输入截断为 512 个标记。
当输入长度大于时如何使其抛出异常max_seq_length?
此外, 的最大可能是max_seq_length多少all-MiniLM-L6-v2?
nlp bert-language-model huggingface-transformers huggingface-tokenizers sentence-transformers
我想在我的数据集和 GCP VM 实例上微调 Starcoder ( https://huggingface.co/bigcode/starcoder )。
文档中称,为了训练模型,他们使用了 512 个 Tesla A100 GPU,花了 24 天。
我还在 HuggingFace 的文件部分中看到了模型(.bin)文件(https://huggingface.co/bigcode/starcoder/tree/main)
模型总大小约为64GB
根据所有这些信息,
deep-learning language-model pytorch huggingface large-language-model