dev*_*1ce 5 nlp huggingface-transformers
我正在尝试生成长 PDF 的摘要。所以,我所做的,首先使用pdfminer.six
库将我的 pdf 转换为文本。接下来,我使用了此处讨论中提供的 2 个函数。
代码:
bart_tokenizer = BartTokenizer.from_pretrained("facebook/bart-large")
bart_model = BartModel.from_pretrained("facebook/bart-large", return_dict=True)
# generate chunks of text \ sentences <= 1024 tokens
def nest_sentences(document):
nested = []
sent = []
length = 0
for sentence in nltk.sent_tokenize(document):
length += len(sentence)
if length < 1024:
sent.append(sentence)
else:
nested.append(sent)
sent = [sentence]
length = len(sentence)
if sent:
nested.append(sent)
return nested
# generate summary on text with <= 1024 tokens
def generate_summary(nested_sentences):
device = 'cuda'
summaries = []
for nested in nested_sentences:
input_tokenized = bart_tokenizer.encode(' '.join(nested), truncation=True, return_tensors='pt')
input_tokenized = input_tokenized.to(device)
summary_ids = bart_model.to(device).generate(
input_tokenized,
length_penalty=3.0,
min_length=30,
max_length=100,
)
output = [bart_tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=False) for g in summary_ids]
summaries.append(output)
summaries = [sentence for sublist in summaries for sentence in sublist]
return summaries
Run Code Online (Sandbox Code Playgroud)
然后,为了获得摘要,我这样做:
nested_sentences = nest_sentences(text)
Run Code Online (Sandbox Code Playgroud)
其中,text
是我使用pdf库转换的长度约为10K的字符串文本。
summary = generate_summary(nested_sentences)
Run Code Online (Sandbox Code Playgroud)
然后,我收到以下错误:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-15-d5aa7709bb5f> in <module>()
----> 1 summary = generate_summary(nested_sentences)
3 frames
<ipython-input-11-8554509269e0> in generate_summary(nested_sentences)
28 length_penalty=3.0,
29 min_length=30,
---> 30 max_length=100,
31 )
32 output = [bart_tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=False) for g in summary_ids]
/usr/local/lib/python3.7/dist-packages/torch/autograd/grad_mode.py in decorate_context(*args, **kwargs)
26 def decorate_context(*args, **kwargs):
27 with self.__class__():
---> 28 return func(*args, **kwargs)
29 return cast(F, decorate_context)
30
/usr/local/lib/python3.7/dist-packages/transformers/generation_utils.py in generate(self, input_ids, max_length, min_length, do_sample, early_stopping, num_beams, temperature, top_k, top_p, repetition_penalty, bad_words_ids, bos_token_id, pad_token_id, eos_token_id, length_penalty, no_repeat_ngram_size, encoder_no_repeat_ngram_size, num_return_sequences, max_time, max_new_tokens, decoder_start_token_id, use_cache, num_beam_groups, diversity_penalty, prefix_allowed_tokens_fn, output_attentions, output_hidden_states, output_scores, return_dict_in_generate, forced_bos_token_id, forced_eos_token_id, remove_invalid_values, synced_gpus, **model_kwargs)
1061 return_dict_in_generate=return_dict_in_generate,
1062 synced_gpus=synced_gpus,
-> 1063 **model_kwargs,
1064 )
1065
/usr/local/lib/python3.7/dist-packages/transformers/generation_utils.py in beam_search(self, input_ids, beam_scorer, logits_processor, stopping_criteria, max_length, pad_token_id, eos_token_id, output_attentions, output_hidden_states, output_scores, return_dict_in_generate, synced_gpus, **model_kwargs)
1799 continue # don't waste resources running the code we don't need
1800
-> 1801 next_token_logits = outputs.logits[:, -1, :]
1802
1803 # hack: adjust tokens for Marian. For Marian we have to make sure that the `pad_token_id`
AttributeError: 'Seq2SeqModelOutput' object has no attribute 'logits'
Run Code Online (Sandbox Code Playgroud)
我找不到与此错误相关的任何内容,因此如果有人可以提供帮助或者是否有更好的方法来生成长文本摘要,我将非常感激?
先感谢您!
小智 10
这里的问题是BartModel系列。将其切换为BartForConditionalGeneration类,问题将得到解决。本质上,生成实用程序假设它是一个可用于语言生成的模型,在这种情况下,BartModel 只是没有 LM 头的基础。
归档时间: |
|
查看次数: |
8394 次 |
最近记录: |