5 python nlp deep-learning keras tensorflow
我正在研究 NLP 分类问题,尝试将培训课程分为 99 个类别。我设法制作了一些模型,包括贝叶斯分类器,但它的准确度只有 55%(非常糟糕)。
\n鉴于这些结果,我尝试微调camemBERT模型(我的数据是法语)以改进模型结果,但我以前从未使用过这些方法,所以我尝试遵循这个示例并将其适应我的代码。
\n在上面的例子中,有 2 个标签,而我有 99 个标签。
\n我完好无损地保留了某些部分
\nepochs = 5\nMAX_LEN = 128\nbatch_size = 16\ndevice = torch.device(\'cuda\' if torch.cuda.is_available() else \'cpu\')\ntokenizer = CamembertTokenizer.from_pretrained(\'camembert-base\',do_lower_case=True)\nRun Code Online (Sandbox Code Playgroud)\n我选择了相同的变量名称,在文本中您有特征列,在标签中您有标签
\ntext = training[\'Intitul\xc3\xa9 (Ce champ doit respecter la nomenclature suivante : Code action \xe2\x80\x93 Libell\xc3\xa9)_x\']\nlabels = training[\'Domaine sou domaine \']\nRun Code Online (Sandbox Code Playgroud)\n我在示例中使用相同的值对序列进行标记和填充,因为我不知道哪些值适合我的数据
\n#user tokenizer to convert sentences into tokenizer\ninput_ids = [tokenizer.encode(sent, add_special_tokens=True, max_length=MAX_LEN) for sent in text]\n\n# Pad our input tokens\ninput_ids = pad_sequences(input_ids, maxlen=MAX_LEN, dtype="long", truncating="post", padding="post")\n\n# Create attention masks\nattention_masks = []\n# Create a mask of 1s for each token followed by 0s for padding\nfor seq in input_ids:\n seq_mask = [float(i > 0) for i in seq]\n attention_masks.append(seq_mask)\nRun Code Online (Sandbox Code Playgroud)\n我注意到上面示例中的标签是数字,因此我使用此代码将标签更改为数字
\nlabel_map = {label: i for i, label in enumerate(set(labels))}\nnumeric_labels = [label_map[label] for label in labels]\nlabels = numeric_labels\nRun Code Online (Sandbox Code Playgroud)\n我开始从张量开始构建模型
\n# Use train_test_split to split our data into train and validation sets for training\ntrain_inputs, validation_inputs, train_labels, validation_labels = train_test_split(\n input_ids, labels, random_state=42, test_size=0.1\n)\n\ntrain_masks, validation_masks = train_test_split(\n attention_masks, random_state=42, test_size=0.1\n)\n\n# Convert the data to torch tensors\ntrain_inputs = torch.tensor(train_inputs)\nvalidation_inputs = torch.tensor(validation_inputs)\ntrain_labels = torch.tensor(train_labels)\nvalidation_labels = torch.tensor(validation_labels)\ntrain_masks = torch.tensor(train_masks)\nvalidation_masks = torch.tensor(validation_masks)\n\n# Create data loaders\ntrain_data = TensorDataset(train_inputs, train_masks, train_labels)\ntrain_sampler = RandomSampler(train_data)\ntrain_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size)\n\nvalidation_data = TensorDataset(validation_inputs, validation_masks, validation_labels)\nvalidation_sampler = SequentialSampler(validation_data)\nvalidation_dataloader = DataLoader(validation_data, sampler=validation_sampler, batch_size=batch_size)\n# Define the model architecture\nmodel = CamembertForSequenceClassification.from_pretrained(\'camembert-base\', num_labels=99)\n\n# Move the model to the appropriate device\nmodel.to(device) \nRun Code Online (Sandbox Code Playgroud)\n输出是:
\nCamembertForSequenceClassification(\n (roberta): RobertaModel(\n (embeddings): RobertaEmbeddings(\n (word_embeddings): Embedding(32005, 768, padding_idx=1)\n (position_embeddings): Embedding(514, 768, padding_idx=1)\n (token_type_embeddings): Embedding(1, 768)\n (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)\n (dropout): Dropout(p=0.1, inplace=False)\n )\n (encoder): RobertaEncoder(\n (layer): ModuleList(\n (0-11): 12 x RobertaLayer(\n (attention): RobertaAttention(\n (self): RobertaSelfAttention(\n (query): Linear(in_features=768, out_features=768, bias=True)\n (key): Linear(in_features=768, out_features=768, bias=True)\n (value): Linear(in_features=768, out_features=768, bias=True)\n (dropout): Dropout(p=0.1, inplace=False)\n )\n (output): RobertaSelfOutput(\n (dense): Linear(in_features=768, out_features=768, bias=True)\n (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)\n (dropout): Dropout(p=0.1, inplace=False)\n )\n )\n (intermediate): RobertaIntermediate(\n (dense): Linear(in_features=768, out_features=3072, bias=True)\n (intermediate_act_fn): GELUActivation()\n )\n (output): RobertaOutput(\n (dense): Linear(in_features=3072, out_features=768, bias=True)\n (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)\n (dropout): Dropout(p=0.1, inplace=False)\n )\n )\n )\n )\n )\n (classifier): RobertaClassificationHead(\n (dense): Linear(in_features=768, out_features=768, bias=True)\n (dropout): Dropout(p=0.1, inplace=False)\n (out_proj): Linear(in_features=768, out_features=99, bias=True)\n )\n)\nRun Code Online (Sandbox Code Playgroud)\n然后我继续创建神经网络
\nparam_optimizer = list(model.named_parameters())\noptimizer_grouped_parameters = [{\'params\': [p for n, p in param_optimizer], \'weight_decay_rate\': 0.01}]\noptimizer = AdamW(optimizer_grouped_parameters, lr=2e-5, eps=10e-8)\n\n# Function to calculate the accuracy of our predictions vs labels\ndef flat_accuracy(preds, labels):\n pred_flat = np.argmax(preds, axis=1).flatten()\n labels_flat = labels.flatten()\n return np.sum(pred_flat == labels_flat) / len(labels_flat)\n\ntrain_loss_set = []\n\n# trange is a tqdm wrapper around the normal python range\nfor _ in trange(epochs, desc="Epoch"): \n # Tracking variables for training\n tr_loss = 0\n nb_tr_examples, nb_tr_steps = 0, 0\n \n # Train the model\n model.train()\n for step, batch in enumerate(train_dataloader):\n # Add batch to device CPU or GPU\n batch = tuple(t.to(device) for t in batch)\n # Unpack the inputs from our dataloader\n b_input_ids, b_input_mask, b_labels = batch\n # Clear out the gradients (by default they accumulate)\n optimizer.zero_grad()\n # Forward pass\n outputs = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask, labels=b_labels)\n # Get loss value\n loss = outputs[0]\n # Add it to train loss list\n train_loss_set.append(loss.item()) \n # Backward pass\n loss.backward()\n # Update parameters and take a step using the computed gradient\n optimizer.step()\n \n # Update tracking variables\n tr_loss += loss.item()\n nb_tr_examples += b_input_ids.size(0)\n nb_tr_steps += 1\n\n print("Train loss: {}".format(tr_loss / nb_tr_steps))\n\n # Tracking variables for validation\n eval_loss, eval_accuracy = 0, 0\n nb_eval_steps, nb_eval_examples = 0, 0\n # Validation of the model\n model.eval()\n # Evaluate data for one epoch\n for batch in validation_dataloader:\n # Add batch to device CPU or GPU\n batch = tuple(t.to(device) for t in batch)\n # Unpack the inputs from our dataloader\n b_input_ids, b_input_mask, b_labels = batch\n # Telling the model not to compute or store gradients, saving memory and speeding up validation\n with torch.no_grad():\n # Forward pass, calculate logit predictions\n outputs = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask, labels=b_labels)\n loss, logits = outputs[:2]\n \n # Move logits and labels to CPU if GPU is used\n logits = logits.detach().cpu().numpy()\n label_ids = b_labels.to(\'cpu\').numpy()\n\n tmp_eval_accuracy = flat_accuracy(logits, label_ids)\n \n eval_accuracy += tmp_eval_accuracy\n nb_eval_steps += 1\n\n print("Validation Accuracy: {}".format(eval_accuracy / nb_eval_steps))\nRun Code Online (Sandbox Code Playgroud)\n代码可以工作,但准确率仅为 30%,这比使用非常简单的算法和直接计算的贝叶斯分类器差得多。这让我意识到我一定是对模型进行了错误的微调,但是我对微调的理解还不够深入,不知道我哪里出了问题。
\n小智 2
我目前正在研究一些序列分类任务,我在培训期间注意到的一些事情可能会对您的情况有所帮助。
截断:如果有一个句子大于 128 个标记(MAX_LEN)并且您要截断它,那么本质上模型只能预测部分数据点(部分字符串,因为如果字符串的长度 > 128 个标记,则字符串会被截断)。
虽然这是我使用的一个技巧,对我来说似乎很现实,但你实际上可以做的是如下 -
| 归档时间: |
|
| 查看次数: |
411 次 |
| 最近记录: |