标签: distilbert

“正在使用bos_token,但尚未设置”是什么意思?

当我运行 demo.py 时

\n
from transformers import AutoTokenizer, AutoModel\n    \ntokenizer = AutoTokenizer.from_pretrained("distilbert-base-multilingual-cased")\nmodel = AutoModel.from_pretrained("distilbert-base-multilingual-cased", return_dict=True)\n# print(model)\ndef count_parameters(model):\n    return sum(p.numel() for p in model.parameters() if p.requires_grad)\nprint(count_parameters(model))\ninputs = tokenizer("\xe5\x8f\xb2\xe5\xaf\x86\xe6\x96\xaf\xe5\x85\x88\xe7\x94\x9f\xe4\xb8\x8d\xe5\x9c\xa8\xef\xbc\x8c\xe4\xbb\x96\xe5\x8e\xbb\xe7\x9c\x8b\xe7\x94\xb5\xe5\xbd\xb1\xe4\xba\x86\xe3\x80\x82Mr Smith is not in. He ________ ________to the cinema", return_tensors="pt")\nprint(inputs)\noutputs = model(**inputs)\nprint(outputs)\n
Run Code Online (Sandbox Code Playgroud)\n

代码显示

\n
{'input_ids': tensor([[  101,  2759,  3417,  4332,  2431,  5600,  2080,  3031, 10064,  2196,\n      2724,  5765,  5614,  3756,  2146,  1882, 12916, 11673, 10124, 10472,\n     10106,   119, 10357,   168,   168,   168,   168,   168,   168,   168,\n       168,   168,   168,   168,   168,   168, …
Run Code Online (Sandbox Code Playgroud)

multilingual distilbert huggingface-transformers huggingface-tokenizers

7
推荐指数
1
解决办法
1万
查看次数

使用 Huggingface 的蒸馏器模型生成文本

一段时间以来,我一直在为 Huggingface 的 DistilBERT 模型苦苦挣扎,因为文档似乎非常不清楚及其示例(例如https://github.com/huggingface/transformers/blob/master/notebooks/Comparing-TF-and-PT -models-MLM-NSP.ipynbhttps://github.com/huggingface/transformers/tree/master/examples/distillation)非常厚,他们展示的东西似乎没有很好的记录。

我想知道这里是否有人有任何经验并且知道一些很好的代码示例,用于他们模型的基本 Python 内使用。即:

  • 如何将模型的输出正确解码为实际文本(无论我如何改变它的形状,标记器似乎都愿意解码它并总是产生一些[UNK]标记序列)

  • 如何实际使用他们的调度器+优化器来训练一个简单的文本到文本任务的模型。

nlp machine-learning pytorch distilbert huggingface-transformers

6
推荐指数
1
解决办法
300
查看次数

ValueError: TextEncodeInput must be Union[TextInputSequence, Tuple[InputSequence, InputSequence]] - Tokenizing BERT / Distilbert Error

def split_data(path):
  df = pd.read_csv(path)
  return train_test_split(df , test_size=0.1, random_state=100)

train, test = split_data(DATA_DIR)
train_texts, train_labels = train['text'].to_list(), train['sentiment'].to_list() 
test_texts, test_labels = test['text'].to_list(), test['sentiment'].to_list() 

train_texts, val_texts, train_labels, val_labels = train_test_split(train_texts, train_labels, test_size=0.1, random_state=100)

from transformers import DistilBertTokenizerFast
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased

train_encodings = tokenizer(train_texts, truncation=True, padding=True)
valid_encodings = tokenizer(valid_texts, truncation=True, padding=True)
test_encodings = tokenizer(test_texts, truncation=True, padding=True)
Run Code Online (Sandbox Code Playgroud)

When I tried to split from the dataframe using BERT tokenizers I got an error us such.

tokenize bert-language-model distilbert huggingface-transformers huggingface-tokenizers

5
推荐指数
4
解决办法
6751
查看次数