有没有更好的方法来使用 PdfStripper 转换 pdf 的字节数组？

Question

有没有更好的方法来使用 PdfStripper 转换 pdf 的字节数组？

我有一个 pdf 文件的字节数组，想要从文件中获取文本。我的下面的代码可以工作，但我需要先创建一个实际的文件。你知道更好的方法吗，这样我就不必先创建这个文件了？

try {
  File temp = File.createTempFile("temp-pdf", ".tmp");
  OutputStream out = new FileOutputStream(temp);
  out.write(Base64.decodeBase64(testObject.getPdfAsDoc().getContent()));
  out.close();
  PDDocument document = PDDocument.load(temp);
  PDFTextStripper pdfStripper = new PDFTextStripper();
  String text = pdfStripper.getText(document);
  log.info(text);
} catch(IOException e){

}

Run Code Online (Sandbox Code Playgroud)

Answer 1

mkl*_*mkl 6

答案取决于您使用的 PDFBox 版本。

PDFBox 2.0.x

每当你有一个byte[]（你似乎从获得一个Base64.decodeBase64），你可以直接加载它：

byte[] documentBytes = Base64.decodeBase64(testObject.getPdfAsDoc().getContent());
PDDocument document = PDDocument.load(documentBytes);

Run Code Online (Sandbox Code Playgroud)

PDFBox 1.8.x

每当你有一个时byte[]，你可以通过 a 加载它ByteArrayInputStream：

byte[] documentBytes = Base64.decodeBase64(testObject.getPdfAsDoc().getContent());
InputStream documentStream = new ByteArrayInputStream(documentBytes);
PDDocument document = PDDocument.load(documentStream);

Run Code Online (Sandbox Code Playgroud)

顺便说一句：使用 PDFBox 1.8.x 时，您应该使用loadNonSeq重载，而不是使用重载，load因为load它不会按照指定的方式加载 PDF，因此可能会被欺骗而读取错误的内容。不过，如果 PDF 损坏，您仍然可以尝试load作为后备。

归档时间：	8 年，3 月前
查看次数：	6982 次
最近记录：	8 年，3 月前