从字节[]识别分页符

Roh*_*dar 0 java arrays pdfbox

我有一个用例,我通过将字节写入 ServletOutputStream 来下载一个大文件,我想返回一些指定的页面,而无需在内存中完全加载文件并使用库。

  1. 是否可以从字节流中识别分页符?
  2. 如果是,正确的方法应该是什么?

编辑 该文件是使用 Apache PDFBox 创建和存储的。

mkl*_*mkl 5

Is it possible to identify the page break from the byte stream?

No. For the simple reason that there is no page break in the byte stream.

PDF files contain numerous objects (fonts, colorspaces, bitmaps, ...) which can be used on multiple pages. In some PDFs all pages even share all resources. Thus, you don't have a section in the PDF byte array used for a page and only that page.

Furthermore, those objects are referenced via cross reference streams or tables by their offset in the file. So only serving the regions of the byte stream that are needed for some given pages cannot work to start with as the offsets would be wrong then.

Theoretically one could determine the regions in a PDF byte stream which are not used by those given pages and transfer 0s instead. If you employ some transport compression, these regions would compress quite well. But to determine those regions, you'd need a PDF library which you don't want to do.

或者,有一种特殊的方法可以保存针对部分文件访问优化的 PDF 文件(这样保存的文件称为“线性化”),但这对您没有帮助,因为 PDFBox 不提供这样的保存 PDF 并且因为使用了它优化需要支持 servlet 容器或 servlet 本身很少支持的 HTTP 范围。


IMO 您最好的选择是更改大文件的生成以生成您想要的较小文件,而不是(或除了)大文件。