Huggingface Transformer[marianmt] 翻译较大文本时出现奇怪的结果

Tef*_*foD 3 python translation huggingface-transformers huggingface-tokenizers

我需要翻译数据库中的大量文本。因此,我这几天一直在和变压器和模型打交道。我绝对不是数据科学专家,不幸的是我没有进一步了解。

\n

问题始于较长的文本。第二个问题是定序器的通常最大令牌大小 (512)。仅仅截断并不是一个真正的选择。在这里,我确实找到了一种解决方法,但它无法正常工作,结果是较长文本(> 300 个序列)上的单词沙拉

\n

这是一个示例(请忽略警告,这是另一个问题 - 目前没有那么严重)

\n

如果我使用例句 2(55 个序列)或 5 次(163 个序列)-没有问题。

\n

但它会被例如 433 序列(屏幕截图中的第三个绿色文本块)搞乱。

\n

在此输入图像描述

\n

对于超过 510 个序列,我尝试将其分成块,如上面描述的链接所示。但这里的结果也很奇怪。

\n

我很确定 - 我有不止一个错误并且低估了这个主题。\n但我认为没有其他(免费/便宜)的方法来翻译大量文本。

\n

你们能帮我一下吗?您发现哪些(思维)错误以及您建议如何解决这些问题?非常感谢。

\n

在此输入图像描述

\n
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM\nimport torch\n\nif torch.cuda.is_available():  \n  dev = "cuda"\nelse:  \n  dev = "cpu" \ndevice = torch.device(dev)\n \nmname = \'Helsinki-NLP/opus-mt-de-en\'\ntokenizer = AutoTokenizer.from_pretrained(mname)\nmodel = AutoModelForSeq2SeqLM.from_pretrained(mname)\nmodel.to(device)\n\nchunksize = 512\n\ntext_short = "Nach nur sieben Seiten appellierte man an die W\xc3\xa4hlerinnen und W\xc3\xa4hler, sich richtig zu entscheiden, n\xc3\xa4mlich f\xc3\xbcr Frieden, Freiheit, Sozialismus. "\ntext_long = text_short\n#this loop is just for debugging/testing and simulating long text\nfor x in range(30):\n    text_long = text_long + text_short\n\ntokens = tokenizer.encode_plus(text_long, return_tensors="pt", add_special_tokens=True, padding=False, truncation=False).to(device)\nstr_len = len(tokens[\'input_ids\'][0])\n\nif str_len > 510:\n    # split into chunks of 510 tokens, we also convert to list (default is tuple which is immutable)\n    input_id_chunks = list(tokens[\'input_ids\'][0].split(chunksize - 2))\n    mask_chunks = list(tokens[\'attention_mask\'][0].split(chunksize - 2))\n\n    cnt = 1\n    for tensor in input_id_chunks:\n        print(\'\\033[96m\' + \'chunk \' + str(cnt) + \': \' + str(len(tensor)) + \'\\033[93m\')\n        cnt += 1\n    \n    # loop through each chunk\n    # https://towardsdatascience.com/how-to-apply-transformers-to-any-length-of-text-a5601410af7f\n    for i in range(len(input_id_chunks)):\n        # add CLS and SEP tokens to input IDs\n        input_id_chunks[i] = torch.cat([\n            torch.tensor([101]).to(device), input_id_chunks[i], torch.tensor([102]).to(device)\n        ])\n        # add attention tokens to attention mask\n        mask_chunks[i] = torch.cat([\n            torch.tensor([1]).to(device), mask_chunks[i], torch.tensor([1]).to(device)\n        ])\n        # get required padding length\n        pad_len = chunksize - input_id_chunks[i].shape[0]\n        # check if tensor length satisfies required chunk size\n        if pad_len > 0:\n            # if padding length is more than 0, we must add padding\n            input_id_chunks[i] = torch.cat([\n                input_id_chunks[i], torch.Tensor([0] * pad_len).to(device)\n            ])\n            mask_chunks[i] = torch.cat([\n                mask_chunks[i], torch.Tensor([0] * pad_len).to(device)\n            ])\n   \n    input_ids = torch.stack(input_id_chunks)\n    attention_mask = torch.stack(mask_chunks)\n    input_dict = {\'input_ids\': input_ids.long(), \'attention_mask\': attention_mask.int()}\n    \n    outputs = model.generate(**input_dict)\n    #this doesnt work - following error comes to the console --> "host_softmax" not implemented for \'Long\'\n    #probs = torch.nn.functional.softmax(outputs[0], dim=-1)\n    # probs\n    # probs = probs.mean(dim=0)\n    # probs\n  \nelse:\n    tokens["input_ids"] = tokens["input_ids"][:, :512] #truncating normally not necessary\n    tokens["attention_mask"] = tokens["attention_mask"][:, :512]\n    outputs = model.generate(**tokens)\n\ndecoded = tokenizer.decode(outputs[0], skip_special_tokens=True)\nprint(\'\\033[94m\' + str(str_len))\nprint(\'\\033[92m\' + decoded)\n
Run Code Online (Sandbox Code Playgroud)\n

评论; 需要以下库:

\n
\n

pip3安装torch==1.9.0+cu102 torchvision==0.10.0+cu102 torchaudio===0.9.0 -f https://download.pytorch.org/whl/torch_stable.html

\n
\n
\n

点安装变压器

\n
\n
\n

pip 安装句子

\n
\n

Ros*_*nyi 6

要使用转换器翻译长文本,您可以按段落拆分文本,按句子拆分段落,然后将句子批量输入模型。无论如何,最好用 MarianMT 逐句翻译,因为如果您将长文本作为一个整体提供给它,它可能会丢失一些部分。

\n
from transformers import MarianMTModel, MarianTokenizer\nfrom nltk.tokenize import sent_tokenize\nfrom nltk.tokenize import LineTokenizer\nimport math\nimport torch\n\nif torch.cuda.is_available():  \n  dev = "cuda"\nelse:  \n  dev = "cpu" \ndevice = torch.device(dev)\n \nmname = 'Helsinki-NLP/opus-mt-de-en'\ntokenizer = MarianTokenizer.from_pretrained(mname)\nmodel = MarianMTModel.from_pretrained(mname)\nmodel.to(device)\n\nlt = LineTokenizer()\nbatch_size = 8\n\ntext_short = "Nach nur sieben Seiten appellierte man an die W\xc3\xa4hlerinnen und W\xc3\xa4hler, sich richtig zu entscheiden, n\xc3\xa4mlich f\xc3\xbcr Frieden, Freiheit, Sozialismus. "\ntext_long = text_short * 30\n\nparagraphs = lt.tokenize(text_long)   \ntranslated_paragraphs = []\n\nfor paragraph in paragraphs:\n    sentences = sent_tokenize(paragraph)\n    batches = math.ceil(len(sentences) / batch_size)     \n    translated = []\n    for i in range(batches):\n        sent_batch = sentences[i*batch_size:(i+1)*batch_size]\n        model_inputs = tokenizer(sent_batch, return_tensors="pt", padding=True, truncation=True, max_length=500).to(device)\n        with torch.no_grad():\n            translated_batch = model.generate(**model_inputs)\n        translated += translated_batch\n    translated = [tokenizer.decode(t, skip_special_tokens=True) for t in translated]\n    translated_paragraphs += [" ".join(translated)]\n\ntranslated_text = "\\n".join(translated_paragraphs)\n
Run Code Online (Sandbox Code Playgroud)\n