Bla*_*awk 6 nlp bert-language-model huggingface-transformers huggingface-tokenizers sentence-transformers
我按以下方式使用 Huggingface 的句子 BERT:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
model.max_seq_length = 512
model.encode(text)
Run Code Online (Sandbox Code Playgroud)
当text很长且包含超过 512 个标记时,不会抛出异常。我假设它会自动将输入截断为 512 个标记。
当输入长度大于时如何使其抛出异常max_seq_length?
此外, 的最大可能是max_seq_length多少all-MiniLM-L6-v2?
cro*_*oik 10
首先,应该注意的是,句子转换器支持与底层转换器不同的序列长度。您可以使用以下方法检查这些值:
# that's the sentence transformer
print(model.max_seq_length)
# that's the underlying transformer
print(model[0].auto_model.config.max_position_embeddings)
Run Code Online (Sandbox Code Playgroud)
输出:
256
512
Run Code Online (Sandbox Code Playgroud)
That means, the position embedding layer of the transformers has 512 weights, but the sentence transformer will only use and was trained with the first 256 of them. Therefore, you should be careful with increasing the value above 256. It will work from a technical perspective, but the position embedding weights (>256) are not properly trained and can therefore mess up your results. Please also check this StackOverflow post.
Regarding throwing an exception, I think that is not offered by the library and you, therefore, have a write a workaround by yourself:
256
512
Run Code Online (Sandbox Code Playgroud)