如何进行Tokenizer批处理?- 拥抱脸

Luc*_*edo 4 tokenize batch-processing pytorch huggingface-transformers huggingface-tokenizers

在Huggingface 的Tokenizer文档中,调用函数接受 List[List[str]] 并表示:

\n
\n

text (str, List[str], List[List[str]], 可选) \xe2\x80\x94 要编码的序列或一批序列。每个序列可以是一个字符串或字符串列表(预标记化字符串)。如果序列作为字符串列表(预标记化)提供,则必须设置 is_split_into_words=True (以消除一批序列的歧义)。

\n
\n

如果我运行,一切都会正常运行:

\n
 test = ["hello this is a test", "that transforms a list of sentences", "into a list of list of sentences", "in order to emulate, in this case, two batches of the same lenght", "to be tokenized by the hf tokenizer for the defined model"]\n tokenizer = AutoTokenizer.from_pretrained(\'distilbert-base-uncased-finetuned-sst-2-english\')\n tokenized_test = tokenizer(text=test, padding="max_length", is_split_into_words=False, truncation=True, return_tensors="pt")\n
Run Code Online (Sandbox Code Playgroud)\n

但如果我尝试模拟批量句子:

\n
 test = ["hello this is a test", "that transforms a list of sentences", "into a list of list of sentences", "in order to emulate, in this case, two batches of the same lenght", "to be tokenized by the hf tokenizer for the defined model"]\n test = [test, test]\n tokenizer = AutoTokenizer.from_pretrained(\'distilbert-base-uncased-finetuned-sst-2-english\')\n tokenized_test = tokenizer(text=test, padding="max_length", is_split_into_words=False, truncation=True, return_tensors="pt")\n
Run Code Online (Sandbox Code Playgroud)\n

我得到:

\n
Traceback (most recent call last):\n  File "/Users/lucazeve/Coding/WxCC_Sentiment_Analysis/modify_scores.py", line 53, in <module>\n    tokenized_test = tokenizer(text=test, padding="max_length", is_split_into_words=False, truncation=True, return_tensors="pt")\n  File "/Users/lucazeve/Coding/WxCC_Sentiment_Analysis/venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2548, in __call__\n    encodings = self._call_one(text=text, text_pair=text_pair, **all_kwargs)\n  File "/Users/lucazeve/Coding/WxCC_Sentiment_Analysis/venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2634, in _call_one\n    return self.batch_encode_plus(\n  File "/Users/lucazeve/Coding/WxCC_Sentiment_Analysis/venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2825, in batch_encode_plus\n    return self._batch_encode_plus(\n  File "/Users/lucazeve/Coding/WxCC_Sentiment_Analysis/venv/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 428, in _batch_encode_plus\n    encodings = self._tokenizer.encode_batch(\nTypeError: TextEncodeInput must be Union[TextInputSequence, Tuple[InputSequence, InputSequence]]\n
Run Code Online (Sandbox Code Playgroud)\n

文档有误吗?我只需要一种使用批次进行标记和预测的方法,它应该不那么难。

\n

这与争论有关吗is_split_into_words

\n
\n

情境化

\n

我会将其输入情绪评分模型(代码片段中定义的模型)。我在预测时遇到 OOM 问题,因此我需要将数据批量提供给模型。

\n

文档(上面提到的)指出我可以在分词器中提供 List[List[str]] ,但事实并非如此。问题仍然是一样的:如何标记批量句子?

\n

注意:我不需要分批进行标记化过程(尽管它会产生一批标记/attention_tokens),这将解决我的问题:使用模型进行批量预测,如下所示:

\n
with torch.no_grad():\n    logits = model(**tokenized_test).logits\n\n
Run Code Online (Sandbox Code Playgroud)\n

alv*_*vas 7

如何标记句子列表?

\n

如果它只是标记句子列表,请执行以下操作:

\n
from transformers import AutoTokenizer\n\ntokenizer = AutoTokenizer.from_pretrained(\'distilbert-base-uncased-finetuned-sst-2-english\')\n\ntest = ["hello this is a test", "that transforms a list of sentences", "into a list of list of sentences", "in order to emulate, in this case, two batches of the same lenght", "to be tokenized by the hf tokenizer for the defined model"]\n \ntokenizer(test)\n
Run Code Online (Sandbox Code Playgroud)\n

它自动进行批处理

\n
{\'input_ids\': [\n [101, 7592, 2023, 2003, 1037, 3231, 102], [101, 2008, 21743, 1037, 2862, 1997, 11746, 102], \n [101, 2046, 1037, 2862, 1997, 2862, 1997, 11746, 102], \n [101, 1999, 2344, 2000, 7861, 9869, 1010, 1999, 2023, 2553, 1010, 2048, 14108, 2229, 1997, 1996, 2168, 18798, 13900, 102], \n [101, 2000, 2022, 19204, 3550, 2011, 1996, 1044, 2546, 19204, 17629, 2005, 1996, 4225, 2944, 102]], \n\n\'attention_mask\': [[1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}\n
Run Code Online (Sandbox Code Playgroud)\n

如何与 一起使用AutoModelForSequenceClassification

\n

要与 一起使用AutoModelForSequenceClassification.from_pretrained(\'distilbert-base-uncased-finetuned-sst-2-english\'),它是这样的:

\n
from transformers import AutoTokenizer, AutoModelForSequenceClassification\n\ntokenizer = AutoTokenizer.from_pretrained(\'distilbert-base-uncased-finetuned-sst-2-english\')\nmodel = AutoModelForSequenceClassification.from_pretrained(\'distilbert-base-uncased-finetuned-sst-2-english\')\n\ntest = ["hello this is a test", "that transforms a list of sentences", "into a list of list of sentences", "in order to emulate, in this case, two batches of the same lenght", "to be tokenized by the hf tokenizer for the defined model"]\n\nmodel(**tokenizer(test, return_tensors=\'pt\', padding=True, truncation=True))\n
Run Code Online (Sandbox Code Playgroud)\n

[出去]:

\n
SequenceClassifierOutput(loss=None, logits=tensor([[ 1.5094, -1.2056],\n        [-3.4114,  3.5229],\n        [ 1.8835, -1.6886],\n        [ 3.0780, -2.5745],\n        [ 2.5383, -2.1984]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)\n
Run Code Online (Sandbox Code Playgroud)\n

如何使用distilbert-base-uncased-finetuned-sst-2-english模型进行情感分类?

\n
from transformers import AutoTokenizer, AutoModelForSequenceClassification\n\ntokenizer = AutoTokenizer.from_pretrained(\'distilbert-base-uncased-finetuned-sst-2-english\')\nmodel = AutoModelForSequenceClassification.from_pretrained(\'distilbert-base-uncased-finetuned-sst-2-english\')\n\nclassifier = pipeline(\'sentiment-analysis\', model=model, tokenizer=tokenizer)\n\n\ntext = [\'hello this is a test\',\n \'that transforms a list of sentences\',\n \'into a list of list of sentences\',\n \'in order to emulate, in this case, two batches of the same lenght\',\n \'to be tokenized by the hf tokenizer for the defined model\']\n \nclassifier(text)\n
Run Code Online (Sandbox Code Playgroud)\n

[出去]:

\n
[{\'label\': \'NEGATIVE\', \'score\': 0.9379092454910278},\n {\'label\': \'POSITIVE\', \'score\': 0.9990271329879761},\n {\'label\': \'NEGATIVE\', \'score\': 0.9726701378822327},\n {\'label\': \'NEGATIVE\', \'score\': 0.9965035915374756},\n {\'label\': \'NEGATIVE\', \'score\': 0.9913086891174316}]\n
Run Code Online (Sandbox Code Playgroud)\n

当 GPU 出现 OOM 问题时会发生什么?

\n

如果是的话distilbert-base-uncased-finetuned-sst-2-english,你应该只使用CPU。为此,您不会遇到太多 OOM 问题。

\n

如果您需要使用GPU,请考虑使用推理pipeline(...),它带有batch_size选项,例如

\n
from transformers import AutoTokenizer, AutoModelForSequenceClassification\n\ntokenizer = AutoTokenizer.from_pretrained(\'distilbert-base-uncased-finetuned-sst-2-english\')\nmodel = AutoModelForSequenceClassification.from_pretrained(\'distilbert-base-uncased-finetuned-sst-2-english\')\n\nclassifier = pipeline(\'sentiment-analysis\', model=model, tokenizer=tokenizer)\n\n\ntext = [\'hello this is a test\',\n \'that transforms a list of sentences\',\n \'into a list of list of sentences\',\n \'in order to emulate, in this case, two batches of the same lenght\',\n \'to be tokenized by the hf tokenizer for the defined model\']\n\nclassifier(text, batch_size=2, truncation="only_first")\n\n
Run Code Online (Sandbox Code Playgroud)\n

当您面临 OOM 问题时,通常不是分词器造成问题,除非您将完整的大型数据集加载到设备中。

\n

如果只是模型无法预测何时输入大型数据集,请考虑使用pipeline而不是使用model(**tokenize(text))

\n

看看https://huggingface.co/docs/transformers/main_classes/pipelines#pipeline-batching

\n
\n

如果问题与is_split_into_words参数有关,\n则来自文档

\n
\n

text (str, List[str], List[List[str]], 可选) \xe2\x80\x94 要编码的序列或一批序列。每个序列可以是一个字符串或字符串列表(预标记化字符串)。如果序列作为字符串列表(预标记化)提供,则必须设置 is_split_into_words=True (以消除一批序列的歧义)。

\n
\n

并从代码中

\n
if is_split_into_words:\n    is_batched = isinstance(text, (list, tuple)) and text and isinstance(text[0], (list, tuple))\nelse:\n    is_batched = isinstance(text, (list, tuple))\n
Run Code Online (Sandbox Code Playgroud)\n

如果我们尝试看看您的输入是否is_batched

\n
text = ["hello", "this", "is a test"]\nisinstance(text, (list, tuple)) and text and isinstance(text[0], (list, tuple))\n
Run Code Online (Sandbox Code Playgroud)\n

[出去]:

\n
False\n
Run Code Online (Sandbox Code Playgroud)\n

但是当你将标记包裹在列表中时,

\n
text = [["hello", "this", "is a test"]]\nisinstance(text, (list, tuple)) and text and isinstance(text[0], (list, tuple))\n
Run Code Online (Sandbox Code Playgroud)\n

[出去]:

\n
True\n
Run Code Online (Sandbox Code Playgroud)\n

因此,分词器的使用以及is_split_into_words=True使批处理正常工作将如下所示:

\n
from transformers import AutoTokenizer\nfrom sacremoses import MosesTokenizer\n\nmoses = MosesTokenizer()\nsentences = ["this is a test", "hello world"]\npretokenized_sents = [moses.tokenize(s) for s in sentences]\n\ntokenizer = AutoTokenizer.from_pretrained(\'distilbert-base-uncased-finetuned-sst-2-english\')\n\ntokenizer(\n  text=pretokenized_sents, \n  padding="max_length", \n  is_split_into_words=True, \n  truncation=True, \n  return_tensors="pt"\n)\n
Run Code Online (Sandbox Code Playgroud)\n

[出去]:

\n
{\'input_ids\': tensor([[ 101, 2023, 2003,  ...,    0,    0,    0],\n        [ 101, 7592, 2088,  ...,    0,    0,    0]]), \'attention_mask\': tensor([[1, 1, 1,  ..., 0, 0, 0],\n        [1, 1, 1,  ..., 0, 0, 0]])}\n
Run Code Online (Sandbox Code Playgroud)\n

注意:该is_split_into_words参数的使用不是为了处理批量句子,而是用于指定标记器的输入何时已被预先标记化。

\n