从huggingface特征提取管道中获取句子嵌入

Question

从huggingface特征提取管道中获取句子嵌入

use*_*360 3 nlp machine-learning spacy-transformers huggingface-transformers

如何从 Huggingface 的特征提取管道中获取整个句子的嵌入？

我了解如何获取每个标记的特征（如下），但如何获取整个句子的总体特征？

feature_extraction = pipeline('feature-extraction', model="distilroberta-base", tokenizer="distilroberta-base")
features = feature_extraction("i am sentence")

Run Code Online (Sandbox Code Playgroud)

Answer 1

den*_*ger 6

为了详细解释我在 stackoverflowuser2010 的答案下放置的评论，我将使用“barebone”模型，但行为与组件相同pipeline。

BERT 和派生模型（包括 DistilRoberta，这是您在管道中使用的模型）通常使用特殊标记（主要表示[CLS]第一个标记）来指示句子的开头和结尾，这通常是进行预测的最简单方法/在整个序列上生成嵌入。社区内正在讨论哪种方法更好（另请参阅 stackoverflowuser2010 的更详细答案），但是，如果您只是想要一个“快速”解决方案，那么获取[CLS]令牌肯定是一个有效的策略。

现在，虽然文档不是FeatureExtractionPipeline很清楚，但在您的示例中，我们可以轻松地通过直接模型调用来比较输出，特别是它们的长度：

from transformers import pipeline, AutoTokenizer

# direct encoding of the sample sentence
tokenizer = AutoTokenizer.from_pretrained('distilroberta-base')
encoded_seq = tokenizer.encode("i am sentence")

# your approach
feature_extraction = pipeline('feature-extraction', model="distilroberta-base", tokenizer="distilroberta-base")
features = feature_extraction("i am sentence")

# Compare lengths of outputs
print(len(encoded_seq)) # 5
# Note that the output has a weird list output that requires to index with 0.
print(len(features[0])) # 5

Run Code Online (Sandbox Code Playgroud)

检查的内容时encoded_seq，您会注意到第一个标记是用索引的0，表示序列开始标记（在我们的例子中为嵌入标记）。由于输出长度相同，因此您可以通过执行类似的操作来简单地访问初步句子嵌入

sentence_embedding = features[0][0]

请注意，除非您在下游任务上微调模型，否则“[CLS]”标记的嵌入将是乱码。我假设，如果您按照我在答案中建议的方式汇集令牌嵌入，那么生成的句子嵌入将具有意义，而无需额外的微调。我提出这个概念的原因是开箱即用的“管道”类不提供用于微调底层模型的 API。 (2认同)

归档时间：	5 年，1 月前
查看次数：	11747 次
最近记录：	1 年，11 月前