标签: huggingface-transformers

类型错误：setup() 收到意外的关键字参数“stage”

我正在尝试通过 pytorch_lightning 训练我的问答模型。但是，在运行命令时，trainer.fit(model,data_module)我收到以下错误：

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-72-b9cdaa88efa7> in <module>()
----> 1 trainer.fit(model,data_module)

4 frames
/usr/local/lib/python3.7/dist-packages/pytorch_lightning/trainer/trainer.py in _call_setup_hook(self)
   1488 
   1489         if self.datamodule is not None:
-> 1490             self.datamodule.setup(stage=fn)
   1491         self._call_callback_hooks("setup", stage=fn)
   1492         self._call_lightning_module_hook("setup", stage=fn)

TypeError: setup() got an unexpected keyword argument 'stage'

Run Code Online (Sandbox Code Playgroud)

我已经安装并导入了 pytorch_lightning。

我还定义了data_module = BioQADataModule(train_df, val_df, tokenizer, batch_size = BATCH_SIZE)其中 BATCH_SIZE = 2，N_EPOCHS = 6。

我使用的模型如下：-

model = T5ForConditionalGeneration.from_pretrained(MODEL_NAME, return_dict=True)

Run Code Online (Sandbox Code Playgroud)

另外，我为模型定义了类，如下所示：-

    class BioQAModel(pl.LightningModule):
    
      def __init__(self):
        super().__init__()
        self.model = T5ForConditionalGeneration.from_pretrained(MODEL_NAME, return_dict=True)
    
      def …

Run Code Online (Sandbox Code Playgroud)

python pytorch huggingface-transformers pytorch-lightning

Dar*_*tsu

2022 04-19

4
推荐指数

1
解决办法

4096
查看次数

加载拥抱脸部模型占用太多内存

我正在尝试使用如下代码加载大型拥抱脸部模型：

model_from_disc = AutoModelForCausalLM.from_pretrained(path_to_model)
tokenizer_from_disc = AutoTokenizer.from_pretrained(path_to_model)
generator = pipeline("text-generation", model=model_from_disc, tokenizer=tokenizer_from_disc)

Run Code Online (Sandbox Code Playgroud)

由于内存不足，程序在第一行之后很快就崩溃了。有没有办法在加载模型时对其进行分块，以便程序不会崩溃？

编辑
请参阅 cronoik 的答案以获取已接受的解决方案，但以下是 Hugging Face 文档上的相关页面：

分片检查点： https://huggingface.co/docs/transformers/big_models#sharded-checkpoints :
~:text=in%20the%20future.-,Sharded%20checkpoints,-Since%20version%204.18.0 大型模型加载： https ://huggingface.co/docs/transformers/main_classes/model#:~:text=the%20weights%20instead.-,Large%20model%20loading,-In%20Transformers%204.20.0

python nlp pytorch huggingface-transformers huggingface

Bud*_*lle

2023 03-15

4
推荐指数

1
解决办法

8048
查看次数

HuggingFace 转换器中的默认“Trainer”类是否在幕后使用 PyTorch 或 TensorFlow？

问题

根据官方文档，该类Trainer“为 PyTorch 中大多数标准用例的功能完整训练提供了 API”。

然而，当我尝试Trainer在实践中实际使用时，我收到以下错误消息，这似乎表明 TensorFlow 目前正在幕后使用。

tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.

Run Code Online (Sandbox Code Playgroud)

那么是哪一个呢？HuggingFace 转换器库是否使用 PyTorch 或 TensorFlow 进行内部实现Trainer？是否可以切换为仅使用 PyTorch？我似乎在中找不到相关参数TrainingArguments。

为什么我的脚本不断打印出 TensorFlow 相关错误？不应该Trainer只使用 PyTorch 吗？

源代码

from transformers import GPT2Tokenizer
from transformers import GPT2LMHeadModel
from …

Run Code Online (Sandbox Code Playgroud)

python tensorflow pytorch huggingface-transformers

Ala*_*ACK

2023 03-26

4
推荐指数

1
解决办法

1669
查看次数

Huggingface 的 AutoTokenizer 中的 text_target 参数有什么作用？

我正在遵循此处的指南： https: //huggingface.co/docs/transformers/v4.28.1/tasks/summarization \n指南中有一行如下：

labels = tokenizer(text_target=examples["summary"], max_length=128, truncation=True)\n

Run Code Online (Sandbox Code Playgroud)\n

我不明白该text_target参数的功能。

我尝试了以下代码，最后两行给出了完全相同的结果。

from transformers import AutoTokenizer\ntokenizer = AutoTokenizer.from_pretrained(\'t5-small\')\ntext = "Weiter Verhandlung in Syrien."\ntokenizer(text_target=text, max_length=128, truncation=True)\ntokenizer(text, max_length=128, truncation=True)\n

Run Code Online (Sandbox Code Playgroud)\n

文档只是说text_target (str, List[str], List[List[str]], optional) \xe2\x80\x94 The sequence or batch of sequences to be encoded as target texts.我不太明白。是否在某些情况下设置 text_target会产生不同的结果？

python huggingface-transformers huggingface

Bet*_*tty

2023 05-04

4
推荐指数

1
解决办法

3940
查看次数

如何修复 Huggingface Transformers 中的“Trainer：评估需要 eval_dataset”？

I\xe2\x80\x99m 尝试在没有评估数据集的情况下进行微调。\n为此，I\xe2\x80\x99m 使用以下代码：

training_args = TrainingArguments(\n    output_dir=resume_from_checkpoint,\n    evaluation_strategy="epoch",\n    per_device_train_batch_size=1,\n)\ndef compute_metrics(pred: EvalPrediction):\n    labels = pred.label_ids\n    preds = pred.predictions.argmax(-1)\n    f1 = f1_score(labels, preds, average="weighted")\n    acc = accuracy_score(labels, preds, average="weighted")\n    return {"accuracy": acc, "f1": f1}\ntrainer = Trainer(\n    model=self.nli_model,\n    args=training_args,\n    train_dataset=tokenized_datasets,\n    compute_metrics=compute_metrics,\n)\n

Run Code Online (Sandbox Code Playgroud)\n

但是，我得到

ValueError: Trainer: evaluation requires an eval_dataset\n

Run Code Online (Sandbox Code Playgroud)\n

我认为默认情况下，Trainer 至少在文档中不进行评估\xe2\x80\xa6，我得到了这个想法\xe2\x80\xa6

python pre-trained-model pytorch huggingface-transformers huggingface-trainer

An *_*ea.

2023 05-23

4
推荐指数

1
解决办法

3638
查看次数

如何进行Tokenizer批处理？- 拥抱脸

在Huggingface 的Tokenizer文档中，调用函数接受 List[List[str]] 并表示：

\n
text (str, List[str], List[List[str]], 可选) \xe2\x80\x94 要编码的序列或一批序列。每个序列可以是一个字符串或字符串列表（预标记化字符串）。如果序列作为字符串列表（预标记化）提供，则必须设置 is_split_into_words=True （以消除一批序列的歧义）。
\n

如果我运行，一切都会正常运行：

 test = ["hello this is a test", "that transforms a list of sentences", "into a list of list of sentences", "in order to emulate, in this case, two batches of the same lenght", "to be tokenized by the hf tokenizer for the defined model"]\n tokenizer = AutoTokenizer.from_pretrained(\'distilbert-base-uncased-finetuned-sst-2-english\')\n tokenized_test = tokenizer(text=test, padding="max_length", is_split_into_words=False, truncation=True, return_tensors="pt")\n

Run Code Online (Sandbox Code Playgroud)\n

但如果我尝试模拟批量句子：

\n …

tokenize batch-processing pytorch huggingface-transformers huggingface-tokenizers

Luc*_*edo

2023 06-07

4
推荐指数

1
解决办法

6973
查看次数

ValueError：Tokenizer 类 LlamaTokenizer 不存在或当前未导入

我正在尝试运行这个Hugging Face 博客中的代码。起初，我无法访问模型，所以这个错误：OSError: meta-llama/Llama-2-7b-chat-hf is not a localfolder，现在已解决，我从 Hugging Face 创建了一个有效的访问令牌。现在，我在运行以下代码时遇到不同的错误：

from transformers import AutoTokenizer
import transformers
import torch

model = "meta-llama/Llama-2-7b-chat-hf"

tokenizer = AutoTokenizer.from_pretrained(model, use_auth_token=True)
pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    torch_dtype=torch.float16,
    device_map="auto",
)

sequences = pipeline(
    'I liked "Breaking Bad" and "Band of Brothers". Do you have any recommendations of other shows I might like?\n',
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
    max_length=200,
)
for seq in sequences:
    print(f"Result: {seq['generated_text']}")

Run Code Online (Sandbox Code Playgroud)

错误：

ValueError: Tokenizer class LlamaTokenizer does not exist or …

Run Code Online (Sandbox Code Playgroud)

python huggingface-transformers huggingface llama

Qui*_*ten

lucky-day

4
推荐指数

1
解决办法

6099
查看次数

快速和慢速分词器产生不同的结果

使用 HuggingFace 的管道工具，我惊讶地发现使用快速和慢速分词器时输出存在显着差异。

具体来说，当我运行填充掩码管道时，分配给填充掩码的单词的概率对于快速和慢速分词器是不同的。此外，尽管无论输入的句子的数量和长度如何，快速分词器的预测都保持不变，但慢分词器的情况并非如此。

这是一个最小的例子：

from transformers import pipeline

slow = pipeline('fill-mask', model='bert-base-cased', \
                tokenizer=('bert-base-cased', {"use_fast": False}))

fast = pipeline('fill-mask', model='bert-base-cased', \
                tokenizer=('bert-base-cased', {"use_fast": True}))

s1 = "This is a short and sweet [MASK]."  # "example"
s2 = "This is [MASK]."  # "shorter"

slow([s1, s2])
fast([s1, s2])
slow([s2])
fast([s2])

Run Code Online (Sandbox Code Playgroud)

每个管道调用都会产生可以填充 for 的前 5 个标记[MASK]，以及它们的概率。为简洁起见，我省略了实际输出，但分配给填充[MASK]for 的每个单词的概率s2在所有示例中都不相同。最后 3 个示例给出相同的概率，但第一个示例产生不同的概率。差异如此之大，以至于两组的前 5 名并不一致。

据我所知，这背后的原因是快速和慢速分词器返回不同的输出。快速分词器通过填充 0 将序列长度标准化为 512，然后创建一个注意力掩码来阻止填充。相比之下，慢速分词器仅填充到最长序列的长度，并且不会创建这样的注意力掩码。相反，它将填充的标记类型 id 设置为 1（而不是 0，这是非填充标记的类型）。根据我对 HuggingFace 的实现（在这里找到）的理解，这些是不等价的。

有谁知道这是否是故意的？

python nlp bert-language-model huggingface-transformers huggingface-tokenizers

Mic*_*ael

2020 06-17

3
推荐指数

1
解决办法

1510
查看次数

向 Huggingface 变压器添加额外的层

我想添加额外的Dense预训练的后层TFDistilBertModel，TFXLNetModel并TFRobertaModelHuggingface模型。我已经看到了如何使用 . 来做到这一点TFBertModel，例如在这个笔记本中：

output = bert_model([input_ids,attention_masks])
output = output[1]
output = tf.keras.layers.Dense(32,activation='relu')(output)

Run Code Online (Sandbox Code Playgroud)

所以，在这里我需要使用输出元组的第二项（即带有索引的项1）BERT。根据该文档 TFBertModel已pooler_output在这个元组指数。但其他三个模型没有pooler_output。

那么，如何为其他三个模型输出添加额外的层？

python nlp keras tensorflow huggingface-transformers

kon*_*cov

lucky-day

3
推荐指数

1
解决办法

3169
查看次数

使用没有 IPyWidgets 的 Huggingface Transformer

我正在尝试在名为 Deepnote 的托管 Jupyter 笔记本平台中使用 Huggingface Transformers 库。我想通过管道类下载模型，但不幸的是，deepnote 不支持 IPyWidgets。有没有办法在使用转换器时禁用 IPywidgets？具体如下命令。


classifier = pipeline("zero-shot-classification")

Run Code Online (Sandbox Code Playgroud)

和我收到的错误。

ImportError: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html

Run Code Online (Sandbox Code Playgroud)

注意：安装 IPyWidgets 不是一个选项

python jupyter-notebook ipywidgets huggingface-transformers deepnote

Jos*_*bel

lucky-day

3
推荐指数

1
解决办法

355
查看次数

标签统计

huggingface-transformers ×10

python ×9

pytorch ×5

huggingface ×3

nlp ×3

huggingface-tokenizers ×2

tensorflow ×2

batch-processing ×1

bert-language-model ×1

deepnote ×1

huggingface-trainer ×1

ipywidgets ×1

jupyter-notebook ×1

keras ×1

llama ×1

pre-trained-model ×1

pytorch-lightning ×1

tokenize ×1

问题

源代码

标签 统计

标签统计