如何使用 BertForMaskedLM 或 BertModel 来计算句子的困惑度?

Kai*_*ong 9 nlp transformer-model pytorch bert-language-model huggingface-transformers

我想使用 BertForMaskedLM 或 BertModel 来计算句子的困惑度,所以我编写了这样的代码:

\n
import numpy as np\nimport torch\nimport torch.nn as nn\nfrom transformers import BertTokenizer, BertForMaskedLM\n# Load pre-trained model (weights)\nwith torch.no_grad():\n    model = BertForMaskedLM.from_pretrained(\'hfl/chinese-bert-wwm-ext\')\n    model.eval()\n    # Load pre-trained model tokenizer (vocabulary)\n    tokenizer = BertTokenizer.from_pretrained(\'hfl/chinese-bert-wwm-ext\')\n    sentence = "\xe6\x88\x91\xe4\xb8\x8d\xe4\xbc\x9a\xe5\xbf\x98\xe8\xae\xb0\xe5\x92\x8c\xe4\xbd\xa0\xe4\xb8\x80\xe8\xb5\xb7\xe5\xa5\x8b\xe6\x96\x97\xe7\x9a\x84\xe6\x97\xb6\xe5\x85\x89\xe3\x80\x82"\n    tokenize_input = tokenizer.tokenize(sentence)\n    tensor_input = torch.tensor([tokenizer.convert_tokens_to_ids(tokenize_input)])\n    sen_len = len(tokenize_input)\n    sentence_loss = 0.\n\n    for i, word in enumerate(tokenize_input):\n        # add mask to i-th character of the sentence\n        tokenize_input[i] = \'[MASK]\'\n        mask_input = torch.tensor([tokenizer.convert_tokens_to_ids(tokenize_input)])\n\n        output = model(mask_input)\n\n        prediction_scores = output[0]\n        softmax = nn.Softmax(dim=0)\n        ps = softmax(prediction_scores[0, i]).log()\n        word_loss = ps[tensor_input[0, i]]\n        sentence_loss += word_loss.item()\n\n        tokenize_input[i] = word\n    ppl = np.exp(-sentence_loss/sen_len)\n    print(ppl)\n
Run Code Online (Sandbox Code Playgroud)\n

我认为这段代码是正确的,但我也注意到 BertForMaskedLM\ 的参数masked_lm_labels,那么我可以使用这个参数来更容易地计算句子的 PPL 吗?\n我知道 input_ids 参数是屏蔽输入, masked_lm_labels 参数是所需的输出。但我无法理解其输出损失的实际含义,其代码如下:

\n
if masked_lm_labels is not None:\n    loss_fct = CrossEntropyLoss()  # -100 index = padding token\n    masked_lm_loss = loss_fct(prediction_scores.view(-1, self.config.vocab_size), \n    masked_lm_labels.view(-1))\n    outputs = (masked_lm_loss,) + outputs\n
Run Code Online (Sandbox Code Playgroud)\n

小智 8

是的,您可以使用参数labels(或者masked_lm_labels,我认为参数名称在 Huggingface 变压器的版本中有所不同,无论如何)来指定屏蔽标记位置,并使用-100忽略您不想包含在损失计算中的标记。\nFor例子,

\n
sentence=\'\xe6\x88\x91\xe7\x88\xb1\xe4\xbd\xa0\'\nfrom transformers import BertTokenizer, BertForMaskedLM\nimport torch\nimport numpy as np\n\ntokenizer = BertTokenizer(vocab_file=\'vocab.txt\')\nmodel = BertForMaskedLM.from_pretrained(\'bert-base-chinese\')\n\ntensor_input = tokenizer(sentence, return_tensors=\'pt\')\n# tensor([[ 101, 2769, 4263,  872,  102]])\n\nrepeat_input = tensor_input.repeat(tensor_input.size(-1)-2, 1)\n# tensor([[ 101, 2769, 4263,  872,  102],\n#         [ 101, 2769, 4263,  872,  102],\n#         [ 101, 2769, 4263,  872,  102]])\n\nmask = torch.ones(tensor_input.size(-1) - 1).diag(1)[:-2]\n# tensor([[0., 1., 0., 0., 0.],\n#         [0., 0., 1., 0., 0.],\n#         [0., 0., 0., 1., 0.]])\n\nmasked_input = repeat_input.masked_fill(mask == 1, 103)\n# tensor([[ 101,  103, 4263,  872,  102],\n#         [ 101, 2769,  103,  872,  102],\n#         [ 101, 2769, 4263,  103,  102]])\n\nlabels = repeat_input.masked_fill( masked_input != 103, -100)\n# tensor([[-100, 2769, -100, -100, -100],\n#         [-100, -100, 4263, -100, -100],\n#         [-100, -100, -100,  872, -100]])\n\nloss,_ = model(masked_input, masked_lm_labels=labels)\n\nscore = np.exp(loss.item())\n
Run Code Online (Sandbox Code Playgroud)\n

功能:

\n
def score(model, tokenizer, sentence,  mask_token_id=103):\n  tensor_input = tokenizer.encode(sentence, return_tensors=\'pt\')\n  repeat_input = tensor_input.repeat(tensor_input.size(-1)-2, 1)\n  mask = torch.ones(tensor_input.size(-1) - 1).diag(1)[:-2]\n  masked_input = repeat_input.masked_fill(mask == 1, 103)\n  labels = repeat_input.masked_fill( masked_input != 103, -100)\n  loss,_ = model(masked_input, masked_lm_labels=labels)\n  result = np.exp(loss.item())\n  return result\n\nscore(model, tokenizer, \'\xe6\x88\x91\xe7\x88\xb1\xe4\xbd\xa0\') # returns 45.63794545581973\n
Run Code Online (Sandbox Code Playgroud)\n

  • 你好,@AshwinGeetD'Sa,我们通过一次屏蔽一个标记并平均所有步骤的损失来得到句子的复杂性。OP 通过 for 循环来完成。我只是将每个步骤的输入作为一个批次放在一起,然后将其提供给模型。 (3认同)