标签: bert-language-model

获取 BERT 中“[UNK]”的值

我设计了一个基于BERT的模型来解决NER任务。我正在使用带有预训练模型的transformers库。"dccuchile/bert-base-spanish-wwm-cased"当我的模型检测到一个实体但令牌是时，问题就出现了'[UNK]'。我怎么知道该令牌后面的字符串是哪个？

我知道未知标记无法恢复为原始标记，但我想至少在将输入传递到模型之前捕获该值。

代码非常简单：

    sentenceIds = tokenizer.encode(sentence,add_special_tokens = True)

    inputs = pad_sequences([sentenceIds], maxlen=256, dtype="long", 
                              value=0, truncating="post", padding="post")

    att_mask = torch.tensor([[int(token_id > 0) for token_id in inputs[0]]]).to(device)
    inputs = torch.tensor(inputs).to(device)

    with torch.no_grad():        
        outputs = model(inputs, 
                          token_type_ids=None, 
                          attention_mask=att_mask)

Run Code Online (Sandbox Code Playgroud)

正如您所看到的，这非常简单，只需标记化、填充或截断、创建注意力掩码并调用模型即可。

我尝试过使用regex，试图找到它周围的两个标记以及类似的东西，但我无法正确解决它。

python-3.x pytorch bert-language-model huggingface-transformers

5
推荐指数

1
解决办法

8464
查看次数

将 BERT 模型转换为 TFLite

我有使用预训练的 bert 模型构建的语义搜索引擎的代码。我想将此模型转换为 tflite，以便将其部署到 google mlkit。我想知道如何转换它。我想知道是否有可能将其转换为 tflite。这可能是因为它在官方tensorflow网站上提到： https: //www.tensorflow.org/lite/convert。但我不知道从哪里开始

代码：


from sentence_transformers import SentenceTransformer

# Load the BERT model. Various models trained on Natural Language Inference (NLI) https://github.com/UKPLab/sentence-transformers/blob/master/docs/pretrained-models/nli-models.md and 
# Semantic Textual Similarity are available https://github.com/UKPLab/sentence-transformers/blob/master/docs/pretrained-models/sts-models.md

model = SentenceTransformer('bert-base-nli-mean-tokens')

# A corpus is a list with documents split by sentences.

sentences = ['Absence of sanity', 
             'Lack of saneness',
             'A man is eating food.',
             'A man is eating a piece of bread.',
             'The girl is carrying a baby.',
             'A man …

Run Code Online (Sandbox Code Playgroud)

python tensorflow tensorflow-lite bert-language-model

5
推荐指数

1
解决办法

4493
查看次数

从 BERT 获取嵌入查找结果

在通过 BERT 传递我的代币之前，我想对它们的嵌入（嵌入查找层的结果）执行一些处理。HuggingFace BERT TensorFlow 实现允许我们使用以下方式访问嵌入查找的输出：

import tensorflow as tf
from transformers import BertConfig, BertTokenizer, TFBertModel

bert_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

input_ids = tf.constant(bert_tokenizer.encode("Hello, my dog is cute", add_special_tokens=True))[None, :]
attention_mask = tf.stack([tf.ones(shape=(len(sent),)) for sent in input_ids])
token_type_ids = tf.stack([tf.ones(shape=(len(sent),)) for sent in input_ids])

config = BertConfig.from_pretrained('bert-base-uncased', output_hidden_states=True)
bert_model = TFBertModel.from_pretrained('bert-base-uncased', config=config)

result = bert_model(inputs={'input_ids': input_ids, 
                            'attention_mask': attention_mask, 
                            'token_type_ids': token_type_ids})
inputs_embeds = result[-1][0]  # output of embedding lookup

Run Code Online (Sandbox Code Playgroud)

随后，可以inputs_embeds使用以下方法处理并将其作为输入发送到同一模型：

inputs_embeds = process(inputs_embeds)  # some processing on inputs_embeds done here …

Run Code Online (Sandbox Code Playgroud)

python nlp tensorflow bert-language-model huggingface-transformers

5
推荐指数

1
解决办法

2503
查看次数

HuggingFace 库中基于 BERT 的模型中的 merge.txt 文件意味着什么？

我试图了解 merge.txt 文件在 HuggingFace 库中的 RoBERTa 模型的分词器中推断出什么。然而，他们的网站上没有提及此事。任何帮助表示赞赏。

nlp tokenize bert-language-model huggingface-transformers

5
推荐指数

1
解决办法

2043
查看次数

BertForSequenceClassification 如何在 CLS 向量上进行分类？

背景：

遵循这个问题，当使用 bert 对序列进行分类时，模型使用表示分类任务的“[CLS]”标记。据该论文称：

每个序列的第一个标记始终是一个特殊的分类标记（[CLS]）。与该标记对应的最终隐藏状态用作分类任务的聚合序列表示。

查看 Huggingfaces 存储库，他们的 BertForSequenceClassification 使用 bert pooler 方法：

class BertPooler(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
        self.activation = nn.Tanh()

    def forward(self, hidden_states):
        # We "pool" the model by simply taking the hidden state corresponding
        # to the first token.
        first_token_tensor = hidden_states[:, 0]
        pooled_output = self.dense(first_token_tensor)
        pooled_output = self.activation(pooled_output)
        return pooled_output

Run Code Online (Sandbox Code Playgroud)

我们可以看到他们采用第一个标记（CLS）并将其用作整个句子的表示。具体来说，他们执行的操作hidden_states[:, 0]看起来很像从每个状态中获取第一个元素，而不是获取第一个标记隐藏状态？

我的问题：

我不明白的是他们如何将整个句子的信息编码到这个标记中？CLS 标记是一个常规标记，它有自己的嵌入向量来“学习”句子级别表示吗？为什么我们不能只使用隐藏状态的平均值（编码器的输出）并用它来分类？

编辑：经过一番思考：因为我们使用 CLS 令牌隐藏状态来预测，所以 CLS 令牌嵌入是否正在接受分类任务的训练，因为这是用于分类的令牌（因此是导致错误的主要因素）哪个会传播到它的权重？）

python transformer-model bert-language-model huggingface-transformers

5
推荐指数

1
解决办法

2875
查看次数

我可以使用 BERT 作为特征提取器，而不对我的特定数据集进行任何微调吗？

我正在尝试解决 10 个类别的多标签分类任务，其中相对平衡的训练集由约 25K 样本组成，评估集由约 5K 样本组成。

我正在使用拥抱脸：

model = transformers.BertForSequenceClassification.from_pretrained(...

Run Code Online (Sandbox Code Playgroud)

并获得相当不错的结果（ROC AUC = 0.98）。

然而，我目睹了一些奇怪的行为，我似乎无法理解 -

我添加以下代码行：

for param in model.bert.parameters():
    param.requires_grad = False

Run Code Online (Sandbox Code Playgroud)

同时确保学习模型的其他层，即：

[param[0] for param in model.named_parameters() if param[1].requires_grad == True]
gives
['classifier.weight', 'classifier.bias']

Run Code Online (Sandbox Code Playgroud)

像这样配置时训练模型会产生一些令人尴尬的糟糕结果（ROC AUC = 0.59）。

我的假设是，开箱即用的预训练 BERT 模型（无需任何微调）应该作为分类层相对较好的特征提取器。那么，我到底哪里做错了呢？

pytorch bert-language-model huggingface-transformers

5
推荐指数

1
解决办法

2839
查看次数

基于 BERT 的 NER 模型在反序列化时给出不一致的预测

我正在尝试在 Colab 云 GPU 上使用 HuggingFace 转换器库训练 NER 模型，对其进行 pickle 并将模型加载到我自己的 CPU 上以进行预测。

代码

模型如下：

from transformers import BertForTokenClassification

model = BertForTokenClassification.from_pretrained(
    "bert-base-cased",
    num_labels=NUM_LABELS,
    output_attentions = False,
    output_hidden_states = False
)

Run Code Online (Sandbox Code Playgroud)

我正在使用此代码片段将模型保存在 Colab 上

import torch

torch.save(model.state_dict(), FILENAME)

Run Code Online (Sandbox Code Playgroud)

然后使用将其加载到我的本地CPU上

# Initiating an instance of the model type

model_reload = BertForTokenClassification.from_pretrained(
    "bert-base-cased",
    num_labels=len(tag2idx),
    output_attentions = False,
    output_hidden_states = False
)

# Loading the model
model_reload.load_state_dict(torch.load(FILENAME, map_location='cpu'))
model_reload.eval()

Run Code Online (Sandbox Code Playgroud)

用于标记文本并进行实际预测的代码片段在 Colab GPU 笔记本实例和我的 CPU 笔记本实例上都是相同的。

预期行为

经过 GPU 训练的模型行为正确，并且可以完美地对以下标记进行分类：

O       [CLS]
O       Good …

Run Code Online (Sandbox Code Playgroud)

python pytorch bert-language-model huggingface-transformers

5
推荐指数

1
解决办法

1836
查看次数

AttributeError：'str'对象在pytorch中没有属性'dim'

将模型预测发送到模型时，我在 PyTorch 中得到以下错误输出。有谁知道发生了什么事吗？

以下是我创建的架构模型，在错误输出中，它显示问题存在于 x = self.fc1(cls_hs) 行中。

class BERT_Arch(nn.Module):

    def __init__(self, bert):
      
      super(BERT_Arch, self).__init__()

      self.bert = bert 
      
      # dropout layer
      self.dropout = nn.Dropout(0.1)
      
      # relu activation function
      self.relu =  nn.ReLU()

      # dense layer 1
      self.fc1 = nn.Linear(768,512)
      
      # dense layer 2 (Output layer)
      self.fc2 = nn.Linear(512,2)

      #softmax activation function
      self.softmax = nn.LogSoftmax(dim=1)

    #define the forward pass
    def forward(self, sent_id, mask):

      #pass the inputs to the model  
      _, cls_hs = self.bert(sent_id, attention_mask=mask)
      print(mask)
      print(type(mask))
      
      x = self.fc1(cls_hs)

      x = self.relu(x) …

Run Code Online (Sandbox Code Playgroud)

python machine-learning python-3.x tensorflow bert-language-model

5
推荐指数

2
解决办法

6369
查看次数

如何解释用于序列分类和张量流的 Huggingface Transformers 的 BERT 输出？

我正在使用 bert 进行具有 3 个标签的序列分类任务。为此，我使用带有tensorflow的huggingface转换器，更具体地说是带有bert-base-german-cased模型的TFBertForSequenceClassification类（是的，使用德语句子）。

我绝不是 NLP 方面的专家，这就是为什么我在这里几乎遵循这种方法： https: //towardsdatascience.com/fine-tuning-hugging-face-model-with-custom-dataset-82b8092f5333（进行了一些调整当然）

一切似乎都工作正常，但我从模型收到的输出却让我失望。以下只是上下文中的一些输出。

我与文章中的示例的主要区别是标签的数量。我有 3 个，而文章只介绍了 2 个。

我使用 sklearn.preprocessing 中的 LabelEncoder 来处理我的标签

label_encoder = LabelEncoder()
Y_integer_encoded = label_encoder.fit_transform(Y)

Run Code Online (Sandbox Code Playgroud)

*这里是一个字符串标签列表，所以像这样

['e_3', 'e_1', 'e_2',]

Run Code Online (Sandbox Code Playgroud)

然后变成这样：

array([0, 1, 2], dtype=int64)

Run Code Online (Sandbox Code Playgroud)

然后，我使用 BertTokenizer 处理文本并创建输入数据集（训练和测试）。这些是它们的形状：

 <TensorSliceDataset shapes: ({input_ids: (99,), token_type_ids: (99,), attention_mask: (99,)}, ()), types: ({input_ids: tf.int32, token_type_ids: tf.int32, attention_mask: tf.int32}, tf.int32)>

Run Code Online (Sandbox Code Playgroud)

然后，我按照 Huggingface 文档训练模型。

训练模型的最后一个时期如下所示：

Epoch 3/3
108/108 [==============================] - 24s 223ms/step - loss: 25.8196 - accuracy: 0.7963 - val_loss: 24.5137 - val_accuracy: 0.7243 …

Run Code Online (Sandbox Code Playgroud)

python tensorflow bert-language-model huggingface-transformers

5
推荐指数

1
解决办法

5348
查看次数

相同的句子在 XLNet 中产生不同的向量

我使用XLNet embedding-as-service计算了两个相同句子的向量。但是该模型为两个相同的句子生成不同的向量嵌入，因此余弦相似度不为 1，欧几里得距离也不为 0。在 BERT 的情况下，它工作得很好。例如; 如果

vec1 = en.encode(texts=['he is anger'],pooling='reduce_mean')
vec2 = en.encode(texts=['he is anger'],pooling='reduce_mean')

Run Code Online (Sandbox Code Playgroud)

模型（XLNet）表明这两个句子不相似。

python nlp bert-language-model huggingface-transformers sentence-transformers

5
推荐指数

1
解决办法

254
查看次数

标签统计

bert-language-model ×10

huggingface-transformers ×8

nlp ×3

machine-learning ×1

sentence-transformers ×1

tensorflow-lite ×1

transformer-model ×1

«
1
…
9
10
11
12
13
…
18
»