如何使用Apache PDFBox从PDF文件中提取文本

Question

如何使用Apache PDFBox从PDF文件中提取文本

我想用Apache PDFBox从给定的PDF文件中提取文本.

我写了这段代码:

PDFTextStripper pdfStripper = null;
PDDocument pdDoc = null;
COSDocument cosDoc = null;
File file = new File(filepath);

PDFParser parser = new PDFParser(new FileInputStream(file));
parser.parse();
cosDoc = parser.getDocument();
pdfStripper = new PDFTextStripper();
pdDoc = new PDDocument(cosDoc);
pdfStripper.setStartPage(1);
pdfStripper.setEndPage(5);
String parsedText = pdfStripper.getText(pdDoc);
System.out.println(parsedText);

Run Code Online (Sandbox Code Playgroud)

但是,我收到以下错误:

Exception in thread "main" java.lang.NullPointerException
at org.apache.fontbox.afm.AFMParser.main(AFMParser.java:304)

Run Code Online (Sandbox Code Playgroud)

我将pdfbox-1.8.5.jar和fontbox-1.8.5.jar添加到类路径中.

编辑

我添加System.out.println("program starts");到程序的开头.

我运行它,然后我得到了与上面提到的相同的错误,program starts并没有出现在控制台中.

因此,我认为我的课程路径有问题.

谢谢.

Answer 1

Mat*_*aun 35

使用PDFBox 2.0.7,这是我获取PDF文本的方式:

static String getText(File pdfFile) throws IOException {
    PDDocument doc = PDDocument.load(pdfFile);
    return new PDFTextStripper().getText(doc);
}

Run Code Online (Sandbox Code Playgroud)

像这样称呼它:

try {
    String text = getText(new File("/home/me/test.pdf"));
    System.out.println("Text in PDF: " + text);
} catch (IOException e) {
    e.printStackTrace();
}

Run Code Online (Sandbox Code Playgroud)

由于用户oivemaria在评论中提到:

您可以在应用程序中使用PDFBox,方法是将其添加到依赖项中build.gradle:

dependencies {
  compile group: 'org.apache.pdfbox', name: 'pdfbox', version: '2.0.7'
}

Run Code Online (Sandbox Code Playgroud)

以下是使用Gradle进行依赖关系管理的更多信息.

如果要将PDF格式保留在已分析的文本中,请尝试使用PDFLayoutTextStripper.

Answer 2

小智 34

我执行了你的代码,它运行正常.也许你的问题与FilePath你提交的文件有关.我将我的pdf放在C盘中并硬编码文件路径.我的代码是:

// PDFBox 2.0.8 require org.apache.pdfbox.io.RandomAccessRead
// import org.apache.pdfbox.io.RandomAccessFile;

public class PDFReader{
    public static void main(String args[]) throws IOException {
        PDFTextStripper pdfStripper = null;
        PDDocument pdDoc = null;
        File file = new File("C:/my.pdf");
        PDFParser parser = new PDFParser(new FileInputStream(file));
        parser.parse();
        try (COSDocument cosDoc = parser.getDocument()) {
            pdfStripper = new PDFTextStripper();
            pdDoc = new PDDocument(cosDoc);
            pdfStripper.setStartPage(1);
            pdfStripper.setEndPage(5);
            String parsedText = pdfStripper.getText(pdDoc);
            System.out.println(parsedText);
        }
    }
}

Run Code Online (Sandbox Code Playgroud)

使用pdfbox 2.0.5时,此代码无法编译并出现错误:java.io.FileInputStream无法强制转换为org.apache.pdfbox.io.RandomAccessRead (5认同)
构造函数 PDFParser(FileInputStream) 未定义强制转换为 org.apache.pdfbox.io.RandomAccessRead 给定错误 (2认同)

Answer 3

son*_*s21 5

PdfBox 2.0.3 也有一个命令行工具。

下载jar文件
java -jar pdfbox-app-2.0.3.jar ExtractText [OPTIONS] <inputfile> [output-text-file]

Options:
  -password  <password>        : Password to decrypt document
  -encoding  <output encoding> : UTF-8 (default) or ISO-8859-1, UTF-16BE, UTF-16LE, etc.
  -console                     : Send text to console instead of file
  -html                        : Output in HTML format instead of raw text
  -sort                        : Sort the text before writing
  -ignoreBeads                 : Disables the separation by beads
  -debug                       : Enables debug output about the time consumption of every stage
  -startPage <number>          : The first page to start extraction(1 based)
  -endPage <number>            : The last page to extract(inclusive)
  <inputfile>                  : The PDF document to use
  [output-text-file]           : The file to write the text to

Run Code Online (Sandbox Code Playgroud)

归档时间：	11 年，8 月前
查看次数：	65564 次
最近记录：	6 年，8 月前