如何进行Tokenizer批处理？- 拥抱脸

Question

如何进行Tokenizer批处理？- 拥抱脸

Luc*_*edo 4 tokenize batch-processing pytorch huggingface-transformers huggingface-tokenizers

在Huggingface 的Tokenizer文档中，调用函数接受 List[List[str]] 并表示：

\n

\n
text (str, List[str], List[List[str]], 可选) \xe2\x80\x94 要编码的序列或一批序列。每个序列可以是一个字符串或字符串列表（预标记化字符串）。如果序列作为字符串列表（预标记化）提供，则必须设置 is_split_into_words=True （以消除一批序列的歧义）。
\n

\n

如果我运行，一切都会正常运行：

\n

 test = ["hello this is a test", "that transforms a list of sentences", "into a list of list of sentences", "in order to emulate, in this case, two batches of the same lenght", "to be tokenized by the hf tokenizer for the defined model"]\n tokenizer = AutoTokenizer.from_pretrained(\'distilbert-base-uncased-finetuned-sst-2-english\')\n tokenized_test = tokenizer(text=test, padding="max_length", is_split_into_words=False, truncation=True, return_tensors="pt")\n

Run Code Online (Sandbox Code Playgroud)\n

但如果我尝试模拟批量句子：

\n

 test = ["hello this is a test", "that transforms a list of sentences", "into a list of list of sentences", "in order to emulate, in this case, two batches of the same lenght", "to be tokenized by the hf tokenizer for the defined model"]\n test = [test, test]\n tokenizer = AutoTokenizer.from_pretrained(\'distilbert-base-uncased-finetuned-sst-2-english\')\n tokenized_test = tokenizer(text=test, padding="max_length", is_split_into_words=False, truncation=True, return_tensors="pt")\n

Run Code Online (Sandbox Code Playgroud)\n

我得到：

\n

Traceback (most recent call last):\n  File "/Users/lucazeve/Coding/WxCC_Sentiment_Analysis/modify_scores.py", line 53, in <module>\n    tokenized_test = tokenizer(text=test, padding="max_length", is_split_into_words=False, truncation=True, return_tensors="pt")\n  File "/Users/lucazeve/Coding/WxCC_Sentiment_Analysis/venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2548, in __call__\n    encodings = self._call_one(text=text, text_pair=text_pair, **all_kwargs)\n  File "/Users/lucazeve/Coding/WxCC_Sentiment_Analysis/venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2634, in _call_one\n    return self.batch_encode_plus(\n  File "/Users/lucazeve/Coding/WxCC_Sentiment_Analysis/venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2825, in batch_encode_plus\n    return self._batch_encode_plus(\n  File "/Users/lucazeve/Coding/WxCC_Sentiment_Analysis/venv/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 428, in _batch_encode_plus\n    encodings = self._tokenizer.encode_batch(\nTypeError: TextEncodeInput must be Union[TextInputSequence, Tuple[InputSequence, InputSequence]]\n

Run Code Online (Sandbox Code Playgroud)\n

文档有误吗？我只需要一种使用批次进行标记和预测的方法，它应该不那么难。

\n

这与争论有关吗is_split_into_words？

\n

情境化

\n

我会将其输入情绪评分模型（代码片段中定义的模型）。我在预测时遇到 OOM 问题，因此我需要将数据批量提供给模型。

\n

文档（上面提到的）指出我可以在分词器中提供 List[List[str]] ，但事实并非如此。问题仍然是一样的：如何标记批量句子？

\n

注意：我不需要分批进行标记化过程（尽管它会产生一批标记/attention_tokens），这将解决我的问题：使用模型进行批量预测，如下所示：

\n

with torch.no_grad():\n    logits = model(**tokenized_test).logits\n\n

Run Code Online (Sandbox Code Playgroud)\n

Answer 1

alv*_*vas 7

如何标记句子列表？

\n

如果它只是标记句子列表，请执行以下操作：

\n

from transformers import AutoTokenizer\n\ntokenizer = AutoTokenizer.from_pretrained(\'distilbert-base-uncased-finetuned-sst-2-english\')\n\ntest = ["hello this is a test", "that transforms a list of sentences", "into a list of list of sentences", "in order to emulate, in this case, two batches of the same lenght", "to be tokenized by the hf tokenizer for the defined model"]\n \ntokenizer(test)\n

Run Code Online (Sandbox Code Playgroud)\n

它自动进行批处理：

\n

{\'input_ids\': [\n [101, 7592, 2023, 2003, 1037, 3231, 102], [101, 2008, 21743, 1037, 2862, 1997, 11746, 102], \n [101, 2046, 1037, 2862, 1997, 2862, 1997, 11746, 102], \n [101, 1999, 2344, 2000, 7861, 9869, 1010, 1999, 2023, 2553, 1010, 2048, 14108, 2229, 1997, 1996, 2168, 18798, 13900, 102], \n [101, 2000, 2022, 19204, 3550, 2011, 1996, 1044, 2546, 19204, 17629, 2005, 1996, 4225, 2944, 102]], \n\n\'attention_mask\': [[1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}\n

Run Code Online (Sandbox Code Playgroud)\n

如何与一起使用`AutoModelForSequenceClassification`？

\n

要与一起使用AutoModelForSequenceClassification.from_pretrained(\'distilbert-base-uncased-finetuned-sst-2-english\')，它是这样的：

\n

from transformers import AutoTokenizer, AutoModelForSequenceClassification\n\ntokenizer = AutoTokenizer.from_pretrained(\'distilbert-base-uncased-finetuned-sst-2-english\')\nmodel = AutoModelForSequenceClassification.from_pretrained(\'distilbert-base-uncased-finetuned-sst-2-english\')\n\ntest = ["hello this is a test", "that transforms a list of sentences", "into a list of list of sentences", "in order to emulate, in this case, two batches of the same lenght", "to be tokenized by the hf tokenizer for the defined model"]\n\nmodel(**tokenizer(test, return_tensors=\'pt\', padding=True, truncation=True))\n

Run Code Online (Sandbox Code Playgroud)\n

[出去]：

\n

SequenceClassifierOutput(loss=None, logits=tensor([[ 1.5094, -1.2056],\n        [-3.4114,  3.5229],\n        [ 1.8835, -1.6886],\n        [ 3.0780, -2.5745],\n        [ 2.5383, -2.1984]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)\n

Run Code Online (Sandbox Code Playgroud)\n

如何使用`distilbert-base-uncased-finetuned-sst-2-english`模型进行情感分类？

\n

from transformers import AutoTokenizer, AutoModelForSequenceClassification\n\ntokenizer = AutoTokenizer.from_pretrained(\'distilbert-base-uncased-finetuned-sst-2-english\')\nmodel = AutoModelForSequenceClassification.from_pretrained(\'distilbert-base-uncased-finetuned-sst-2-english\')\n\nclassifier = pipeline(\'sentiment-analysis\', model=model, tokenizer=tokenizer)\n\n\ntext = [\'hello this is a test\',\n \'that transforms a list of sentences\',\n \'into a list of list of sentences\',\n \'in order to emulate, in this case, two batches of the same lenght\',\n \'to be tokenized by the hf tokenizer for the defined model\']\n \nclassifier(text)\n

Run Code Online (Sandbox Code Playgroud)\n

[出去]：

\n

[{\'label\': \'NEGATIVE\', \'score\': 0.9379092454910278},\n {\'label\': \'POSITIVE\', \'score\': 0.9990271329879761},\n {\'label\': \'NEGATIVE\', \'score\': 0.9726701378822327},\n {\'label\': \'NEGATIVE\', \'score\': 0.9965035915374756},\n {\'label\': \'NEGATIVE\', \'score\': 0.9913086891174316}]\n

Run Code Online (Sandbox Code Playgroud)\n

当 GPU 出现 OOM 问题时会发生什么？

\n

如果是的话distilbert-base-uncased-finetuned-sst-2-english，你应该只使用CPU。为此，您不会遇到太多 OOM 问题。

\n

如果您需要使用GPU，请考虑使用推理pipeline(...)，它带有batch_size选项，例如

\n

from transformers import AutoTokenizer, AutoModelForSequenceClassification\n\ntokenizer = AutoTokenizer.from_pretrained(\'distilbert-base-uncased-finetuned-sst-2-english\')\nmodel = AutoModelForSequenceClassification.from_pretrained(\'distilbert-base-uncased-finetuned-sst-2-english\')\n\nclassifier = pipeline(\'sentiment-analysis\', model=model, tokenizer=tokenizer)\n\n\ntext = [\'hello this is a test\',\n \'that transforms a list of sentences\',\n \'into a list of list of sentences\',\n \'in order to emulate, in this case, two batches of the same lenght\',\n \'to be tokenized by the hf tokenizer for the defined model\']\n\nclassifier(text, batch_size=2, truncation="only_first")\n\n

Run Code Online (Sandbox Code Playgroud)\n

当您面临 OOM 问题时，通常不是分词器造成问题，除非您将完整的大型数据集加载到设备中。

\n

如果只是模型无法预测何时输入大型数据集，请考虑使用pipeline而不是使用model(**tokenize(text))

\n

看看https://huggingface.co/docs/transformers/main_classes/pipelines#pipeline-batching

\n

如果问题与is_split_into_words参数有关，\n则来自文档

\n

\n
text (str, List[str], List[List[str]], 可选) \xe2\x80\x94 要编码的序列或一批序列。每个序列可以是一个字符串或字符串列表（预标记化字符串）。如果序列作为字符串列表（预标记化）提供，则必须设置 is_split_into_words=True （以消除一批序列的歧义）。
\n

\n

并从代码中

\n

if is_split_into_words:\n    is_batched = isinstance(text, (list, tuple)) and text and isinstance(text[0], (list, tuple))\nelse:\n    is_batched = isinstance(text, (list, tuple))\n

Run Code Online (Sandbox Code Playgroud)\n

如果我们尝试看看您的输入是否is_batched：

\n

text = ["hello", "this", "is a test"]\nisinstance(text, (list, tuple)) and text and isinstance(text[0], (list, tuple))\n

Run Code Online (Sandbox Code Playgroud)\n

[出去]：

\n

False\n

Run Code Online (Sandbox Code Playgroud)\n

但是当你将标记包裹在列表中时，

\n

text = [["hello", "this", "is a test"]]\nisinstance(text, (list, tuple)) and text and isinstance(text[0], (list, tuple))\n

Run Code Online (Sandbox Code Playgroud)\n

[出去]：

\n

True\n

Run Code Online (Sandbox Code Playgroud)\n

因此，分词器的使用以及is_split_into_words=True使批处理正常工作将如下所示：

\n

from transformers import AutoTokenizer\nfrom sacremoses import MosesTokenizer\n\nmoses = MosesTokenizer()\nsentences = ["this is a test", "hello world"]\npretokenized_sents = [moses.tokenize(s) for s in sentences]\n\ntokenizer = AutoTokenizer.from_pretrained(\'distilbert-base-uncased-finetuned-sst-2-english\')\n\ntokenizer(\n  text=pretokenized_sents, \n  padding="max_length", \n  is_split_into_words=True, \n  truncation=True, \n  return_tensors="pt"\n)\n

Run Code Online (Sandbox Code Playgroud)\n

[出去]：

\n

{\'input_ids\': tensor([[ 101, 2023, 2003,  ...,    0,    0,    0],\n        [ 101, 7592, 2088,  ...,    0,    0,    0]]), \'attention_mask\': tensor([[1, 1, 1,  ..., 0, 0, 0],\n        [1, 1, 1,  ..., 0, 0, 0]])}\n

Run Code Online (Sandbox Code Playgroud)\n

注意：该is_split_into_words参数的使用不是为了处理批量句子，而是用于指定标记器的输入何时已被预先标记化。

\n

归档时间：	2 年，3 月前
查看次数：	6973 次
最近记录：	2 年，3 月前

如何进行Tokenizer批处理？- 拥抱脸

情境化

如何标记句子列表？

如何与 一起使用AutoModelForSequenceClassification？

如何使用distilbert-base-uncased-finetuned-sst-2-english模型进行情感分类？

当 GPU 出现 OOM 问题时会发生什么？

如何与一起使用`AutoModelForSequenceClassification`？

如何使用`distilbert-base-uncased-finetuned-sst-2-english`模型进行情感分类？