Huggingface 长文档摘要

Question

Huggingface 长文档摘要

Mit*_*ops 4 python huggingface-transformers

我预计摘要任务通常会假设长文档。但是，根据此处的文档，我所做的任何简单摘要调用都表示我的文档太长：

>>> summarizer = pipeline("summarization")
>>> summarizer(fulltext)
Token indices sequence length is longer than the specified maximum sequence length for this model (5620 > 1024). Running this sequence through the model will result in indexing errors

>>> summarizer = pipeline("summarization", model="facebook/bart-large-cnn")
>>> summary = summarizer(fulltext)
Token indices sequence length is longer than the specified maximum sequence length for this model (8084 > 1024). Running this sequence through the model will result in indexing errors

>>> summarizer = pipeline("summarization", model="t5-base", tokenizer="t5-base")
>>> summary = summarizer(fulltext)
Token indices sequence length is longer than the specified maximum sequence length for this model (5971 > 512). Running this sequence through the model will result in indexing errors

Run Code Online (Sandbox Code Playgroud)

哪种型号或配置选择使此操作最自动化？我读过其他建议手动分块数据或截断的问题，但边界和块长度的选择似乎会对摘要产生影响。任意长文档的最佳实践是什么？（无限制固然很好，但假设至少有 50,000 个代币。）

Answer 1

cod*_*ord 11

我假设最小令牌长度为 50k 意味着您正在尝试总结像小说一样大的内容。不幸的是，我们还没有一个模型可以同时处理这么多数据。这主要是因为此类模型的内存占用非常高，无法在生产中使用。但是Pegasus (google)、Longformer、Reformer都是总结长文档的可行选择。创建可以处理更大序列而不消耗大量资源的模型的研究仍在继续。例如，Reformer 本身经过高度优化，可以处理大量令牌https://huggingface.co/blog/reformer。到目前为止，最佳实践是“分而治之”的方法。即，对数据进行分块，保持最大长度作为参考。您甚至可以迭代执行此操作，直到达到指定的摘要长度。您还可以探索不同的摘要方法，例如提取和抽象摘要，并利用您的创造力将这些技术结合起来，例如提取摘要和抽象。

归档时间：	4 年，1 月前
查看次数：	3782 次
最近记录：	2 年，4 月前