如何获得 RoBERTa 词嵌入?

Ile*_*Ile 6 encoding nlp word-embedding

给定一个“Roberta 是 BERT 的高度优化版本”类型的句子,我需要使用 RoBERTa 获取该句子中每个单词的嵌入。我试图在网上查看示例代码,但未能找到明确的答案。

我的看法如下:

tokens = roberta.encode(headline)
all_layers = roberta.extract_features(tokens, return_all_hiddens=True)
embedding = all_layers[0]
n = embedding.size()[1] - 1
embedding = embedding[:,1:n,:]
Run Code Online (Sandbox Code Playgroud)

whereembedding[:,1:n,:]用于仅提取句子中单词的嵌入,不包含开始和结束标记。

这是对的吗?

All*_*hvk 0

TOKENIZER_PATH = "../input/roberta-transformers-pytorch/roberta-base"\nROBERTA_PATH = "../input/roberta-transformers-pytorch/roberta-base"\n\ntext= "How are you? I am good."\ntokenizer = AutoTokenizer.from_pretrained(TOKENIZER_PATH)\n\n##how the words are broken into tokens\nprint(tokenizer.tokenize(text))\n\n##the format of a encoding\nprint(tokenizer.batch_encode_plus([text]))\n\n##op wants the input id\nprint(tokenizer.batch_encode_plus([text])['input_ids'])\n\n##op wants the input id without first and last token\nprint(tokenizer.batch_encode_plus([text])['input_ids'][0][1:-1])\n
Run Code Online (Sandbox Code Playgroud)\n

输出:

\n

['如何', '\xc4\xa0are', '\xc4\xa0you', '?', '\xc4\xa0I', '\xc4\xa0am', '\xc4\xa0good', '.']

\n

{'input_ids': [[0, 6179, 32, 47, 116, 38, 524, 205, 4, 2]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1 , 1, 1]]}

\n

[[0, 6179, 32,47, 116, 38, 524, 205, 4, 2]]

\n

[6179、32、47、116、38、524、205、4]

\n

并且不用担心“\xc4\xa0”字符。它只是表明单词前面有一个空格。

\n