如何从 HuggingFace Longformer 中提取文档嵌入

Question

如何从 HuggingFace Longformer 中提取文档嵌入

想做类似的事情

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
outputs = model(input_ids)
last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple

Run Code Online (Sandbox Code Playgroud)

（来自此线程）使用 longformer

文档示例似乎做了类似的事情，但令人困惑（特别是如何设置注意力掩码，我假设我想将其设置为[CLS]令牌，该示例将全局注意力设置为我认为的随机值）

>>> import torch
>>> from transformers import LongformerModel, LongformerTokenizer

>>> model = LongformerModel.from_pretrained('allenai/longformer-base-4096', return_dict=True)
>>> tokenizer = LongformerTokenizer.from_pretrained('allenai/longformer-base-4096')

>>> SAMPLE_TEXT = ' '.join(['Hello world! '] * 1000)  # long input document
>>> input_ids = torch.tensor(tokenizer.encode(SAMPLE_TEXT)).unsqueeze(0)  # batch of size 1

>>> # Attention mask values -- 0: no attention, 1: local attention, 2: global attention
>>> attention_mask = torch.ones(input_ids.shape, dtype=torch.long, device=input_ids.device) # initialize to local attention
>>> attention_mask[:, [1, 4, 21,]] = 2  # Set global attention based on the task. For example,
...                                     # classification: the <s> token
...                                     # QA: question tokens
...                                     # LM: potentially on the beginning of sentences and paragraphs
>>> outputs = model(input_ids, attention_mask=attention_mask)
>>> sequence_output = outputs.last_hidden_state
>>> pooled_output = outputs.pooler_output

Run Code Online (Sandbox Code Playgroud)

（从这里）

Answer 1

Ram*_*ind 1

您不需要弄乱这些值（除非您想优化 longformer 处理不同标记的方式）。在上面列出的示例中，它将强制全局关注第 1 个、第 4 个和第 21 个标记。他们在这里放置了随机数，但有时您可能希望全局参与某种类型的令牌，例如令牌序列中的问题令牌（例如：<问题令牌> + <答案令牌>，但仅全局参与第一部分）。

如果您只是寻找嵌入，您可以按照此处讨论的内容进行操作：用于文档嵌入的 longformer 的最后一层。

归档时间：	5 年，5 月前
查看次数：	793 次
最近记录：	4 年，11 月前