Luc*_*edo 4 tokenize batch-processing pytorch huggingface-transformers huggingface-tokenizers
在Huggingface 的Tokenizer文档中,调用函数接受 List[List[str]] 并表示:
\n\n\ntext (str, List[str], List[List[str]], 可选) \xe2\x80\x94 要编码的序列或一批序列。每个序列可以是一个字符串或字符串列表(预标记化字符串)。如果序列作为字符串列表(预标记化)提供,则必须设置 is_split_into_words=True (以消除一批序列的歧义)。
\n
如果我运行,一切都会正常运行:
\n test = ["hello this is a test", "that transforms a list of sentences", "into a list of list of sentences", "in order to emulate, in this case, two batches of the same lenght", "to be tokenized by the hf tokenizer for the defined model"]\n tokenizer = AutoTokenizer.from_pretrained(\'distilbert-base-uncased-finetuned-sst-2-english\')\n tokenized_test = tokenizer(text=test, padding="max_length", is_split_into_words=False, truncation=True, return_tensors="pt")\n
Run Code Online (Sandbox Code Playgroud)\n但如果我尝试模拟批量句子:
\n test = ["hello this is a test", "that transforms a list of sentences", "into a list of list of sentences", "in order to emulate, in this case, two batches of the same lenght", "to be tokenized by the hf tokenizer for the defined model"]\n test = [test, test]\n tokenizer = AutoTokenizer.from_pretrained(\'distilbert-base-uncased-finetuned-sst-2-english\')\n tokenized_test = tokenizer(text=test, padding="max_length", is_split_into_words=False, truncation=True, return_tensors="pt")\n
Run Code Online (Sandbox Code Playgroud)\n我得到:
\nTraceback (most recent call last):\n File "/Users/lucazeve/Coding/WxCC_Sentiment_Analysis/modify_scores.py", line 53, in <module>\n tokenized_test = tokenizer(text=test, padding="max_length", is_split_into_words=False, truncation=True, return_tensors="pt")\n File "/Users/lucazeve/Coding/WxCC_Sentiment_Analysis/venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2548, in __call__\n encodings = self._call_one(text=text, text_pair=text_pair, **all_kwargs)\n File "/Users/lucazeve/Coding/WxCC_Sentiment_Analysis/venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2634, in _call_one\n return self.batch_encode_plus(\n File "/Users/lucazeve/Coding/WxCC_Sentiment_Analysis/venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2825, in batch_encode_plus\n return self._batch_encode_plus(\n File "/Users/lucazeve/Coding/WxCC_Sentiment_Analysis/venv/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 428, in _batch_encode_plus\n encodings = self._tokenizer.encode_batch(\nTypeError: TextEncodeInput must be Union[TextInputSequence, Tuple[InputSequence, InputSequence]]\n
Run Code Online (Sandbox Code Playgroud)\n文档有误吗?我只需要一种使用批次进行标记和预测的方法,它应该不那么难。
\n这与争论有关吗is_split_into_words
?
我会将其输入情绪评分模型(代码片段中定义的模型)。我在预测时遇到 OOM 问题,因此我需要将数据批量提供给模型。
\n文档(上面提到的)指出我可以在分词器中提供 List[List[str]] ,但事实并非如此。问题仍然是一样的:如何标记批量句子?
\n注意:我不需要分批进行标记化过程(尽管它会产生一批标记/attention_tokens),这将解决我的问题:使用模型进行批量预测,如下所示:
\nwith torch.no_grad():\n logits = model(**tokenized_test).logits\n\n
Run Code Online (Sandbox Code Playgroud)\n
如果它只是标记句子列表,请执行以下操作:
\nfrom transformers import AutoTokenizer\n\ntokenizer = AutoTokenizer.from_pretrained(\'distilbert-base-uncased-finetuned-sst-2-english\')\n\ntest = ["hello this is a test", "that transforms a list of sentences", "into a list of list of sentences", "in order to emulate, in this case, two batches of the same lenght", "to be tokenized by the hf tokenizer for the defined model"]\n \ntokenizer(test)\n
Run Code Online (Sandbox Code Playgroud)\n它自动进行批处理:
\n{\'input_ids\': [\n [101, 7592, 2023, 2003, 1037, 3231, 102], [101, 2008, 21743, 1037, 2862, 1997, 11746, 102], \n [101, 2046, 1037, 2862, 1997, 2862, 1997, 11746, 102], \n [101, 1999, 2344, 2000, 7861, 9869, 1010, 1999, 2023, 2553, 1010, 2048, 14108, 2229, 1997, 1996, 2168, 18798, 13900, 102], \n [101, 2000, 2022, 19204, 3550, 2011, 1996, 1044, 2546, 19204, 17629, 2005, 1996, 4225, 2944, 102]], \n\n\'attention_mask\': [[1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}\n
Run Code Online (Sandbox Code Playgroud)\nAutoModelForSequenceClassification
?要与 一起使用AutoModelForSequenceClassification.from_pretrained(\'distilbert-base-uncased-finetuned-sst-2-english\')
,它是这样的:
from transformers import AutoTokenizer, AutoModelForSequenceClassification\n\ntokenizer = AutoTokenizer.from_pretrained(\'distilbert-base-uncased-finetuned-sst-2-english\')\nmodel = AutoModelForSequenceClassification.from_pretrained(\'distilbert-base-uncased-finetuned-sst-2-english\')\n\ntest = ["hello this is a test", "that transforms a list of sentences", "into a list of list of sentences", "in order to emulate, in this case, two batches of the same lenght", "to be tokenized by the hf tokenizer for the defined model"]\n\nmodel(**tokenizer(test, return_tensors=\'pt\', padding=True, truncation=True))\n
Run Code Online (Sandbox Code Playgroud)\n[出去]:
\nSequenceClassifierOutput(loss=None, logits=tensor([[ 1.5094, -1.2056],\n [-3.4114, 3.5229],\n [ 1.8835, -1.6886],\n [ 3.0780, -2.5745],\n [ 2.5383, -2.1984]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)\n
Run Code Online (Sandbox Code Playgroud)\ndistilbert-base-uncased-finetuned-sst-2-english
模型进行情感分类?from transformers import AutoTokenizer, AutoModelForSequenceClassification\n\ntokenizer = AutoTokenizer.from_pretrained(\'distilbert-base-uncased-finetuned-sst-2-english\')\nmodel = AutoModelForSequenceClassification.from_pretrained(\'distilbert-base-uncased-finetuned-sst-2-english\')\n\nclassifier = pipeline(\'sentiment-analysis\', model=model, tokenizer=tokenizer)\n\n\ntext = [\'hello this is a test\',\n \'that transforms a list of sentences\',\n \'into a list of list of sentences\',\n \'in order to emulate, in this case, two batches of the same lenght\',\n \'to be tokenized by the hf tokenizer for the defined model\']\n \nclassifier(text)\n
Run Code Online (Sandbox Code Playgroud)\n[出去]:
\n[{\'label\': \'NEGATIVE\', \'score\': 0.9379092454910278},\n {\'label\': \'POSITIVE\', \'score\': 0.9990271329879761},\n {\'label\': \'NEGATIVE\', \'score\': 0.9726701378822327},\n {\'label\': \'NEGATIVE\', \'score\': 0.9965035915374756},\n {\'label\': \'NEGATIVE\', \'score\': 0.9913086891174316}]\n
Run Code Online (Sandbox Code Playgroud)\n如果是的话distilbert-base-uncased-finetuned-sst-2-english
,你应该只使用CPU。为此,您不会遇到太多 OOM 问题。
如果您需要使用GPU,请考虑使用推理pipeline(...)
,它带有batch_size选项,例如
from transformers import AutoTokenizer, AutoModelForSequenceClassification\n\ntokenizer = AutoTokenizer.from_pretrained(\'distilbert-base-uncased-finetuned-sst-2-english\')\nmodel = AutoModelForSequenceClassification.from_pretrained(\'distilbert-base-uncased-finetuned-sst-2-english\')\n\nclassifier = pipeline(\'sentiment-analysis\', model=model, tokenizer=tokenizer)\n\n\ntext = [\'hello this is a test\',\n \'that transforms a list of sentences\',\n \'into a list of list of sentences\',\n \'in order to emulate, in this case, two batches of the same lenght\',\n \'to be tokenized by the hf tokenizer for the defined model\']\n\nclassifier(text, batch_size=2, truncation="only_first")\n\n
Run Code Online (Sandbox Code Playgroud)\n当您面临 OOM 问题时,通常不是分词器造成问题,除非您将完整的大型数据集加载到设备中。
\n如果只是模型无法预测何时输入大型数据集,请考虑使用pipeline
而不是使用model(**tokenize(text))
看看https://huggingface.co/docs/transformers/main_classes/pipelines#pipeline-batching
\n如果问题与is_split_into_words
参数有关,\n则来自文档
\n\ntext (str, List[str], List[List[str]], 可选) \xe2\x80\x94 要编码的序列或一批序列。每个序列可以是一个字符串或字符串列表(预标记化字符串)。如果序列作为字符串列表(预标记化)提供,则必须设置 is_split_into_words=True (以消除一批序列的歧义)。
\n
并从代码中
\nif is_split_into_words:\n is_batched = isinstance(text, (list, tuple)) and text and isinstance(text[0], (list, tuple))\nelse:\n is_batched = isinstance(text, (list, tuple))\n
Run Code Online (Sandbox Code Playgroud)\n如果我们尝试看看您的输入是否is_batched
:
text = ["hello", "this", "is a test"]\nisinstance(text, (list, tuple)) and text and isinstance(text[0], (list, tuple))\n
Run Code Online (Sandbox Code Playgroud)\n[出去]:
\nFalse\n
Run Code Online (Sandbox Code Playgroud)\n但是当你将标记包裹在列表中时,
\ntext = [["hello", "this", "is a test"]]\nisinstance(text, (list, tuple)) and text and isinstance(text[0], (list, tuple))\n
Run Code Online (Sandbox Code Playgroud)\n[出去]:
\nTrue\n
Run Code Online (Sandbox Code Playgroud)\n因此,分词器的使用以及is_split_into_words=True
使批处理正常工作将如下所示:
from transformers import AutoTokenizer\nfrom sacremoses import MosesTokenizer\n\nmoses = MosesTokenizer()\nsentences = ["this is a test", "hello world"]\npretokenized_sents = [moses.tokenize(s) for s in sentences]\n\ntokenizer = AutoTokenizer.from_pretrained(\'distilbert-base-uncased-finetuned-sst-2-english\')\n\ntokenizer(\n text=pretokenized_sents, \n padding="max_length", \n is_split_into_words=True, \n truncation=True, \n return_tensors="pt"\n)\n
Run Code Online (Sandbox Code Playgroud)\n[出去]:
\n{\'input_ids\': tensor([[ 101, 2023, 2003, ..., 0, 0, 0],\n [ 101, 7592, 2088, ..., 0, 0, 0]]), \'attention_mask\': tensor([[1, 1, 1, ..., 0, 0, 0],\n [1, 1, 1, ..., 0, 0, 0]])}\n
Run Code Online (Sandbox Code Playgroud)\n注意:该is_split_into_words
参数的使用不是为了处理批量句子,而是用于指定标记器的输入何时已被预先标记化。
归档时间: |
|
查看次数: |
6973 次 |
最近记录: |