Apache Tika提取扫描PDF文件

Question

Apache Tika提取扫描PDF文件

Lor*_*ert 9 java pdf ocr tesseract apache-tika

我在使用Apache TIKA(版本1.10)时遇到了一些麻烦.我得到了一些PDF文件,这些文件只是扫描过的纸片.这意味着每个页面只是一个图像.我的目标是提取PDF文件的文本.

我的tesseract设置正确,提取JPG和PNG文件就像一个魅力.我使用的代码看起来像那样(不介意缺少的排除处理):

public String extractText(InputStream stream) {
    AutoDetectParser parser = new AutoDetectParser();
    BodyContentHandler handler = new BodyContentHandler(Integer.MAX_VALUE);
    Metadata metadata = new Metadata();
    ParseContext context = new ParseContext();
    parser.parse(stream, handler, metadata, context);
    String text = handler.toString();
    return text;
}

Run Code Online (Sandbox Code Playgroud)

我搜索了很多,但我找不到任何适合我的解决方案.我已经尝试过该类的setExtractInlineImages方法,PDFParserConfig但这并没有改变一件事.使用自定义提取嵌入的文档ParsingEmbeddedDocumentExtractor确实提取了doc文件的嵌入资源,但不提取我的PDF文件.

如果你们中的任何人都可以提供一些帮助,那将是非常棒的:)

Answer 1

Lor*_*ert 14

Tim Allison带来了解决方案:

Parser parser = new AutoDetectParser();
BodyContentHandler handler = new BodyContentHandler(Integer.MAX_VALUE);

TesseractOCRConfig config = new TesseractOCRConfig();
PDFParserConfig pdfConfig = new PDFParserConfig();
pdfConfig.setExtractInlineImages(true);

ParseContext parseContext = new ParseContext();
parseContext.set(TesseractOCRConfig.class, config);
parseContext.set(PDFParserConfig.class, pdfConfig);
parseContext.set(Parser.class, parser); //need to add this to make sure recursive parsing happens!

parser.parse(stream, handler, new Metadata(), parseContext);

Run Code Online (Sandbox Code Playgroud)

这对我有用:)

编辑: 这是完整的解决方案:

import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.parser.ocr.TesseractOCRConfig;
import org.apache.tika.parser.pdf.PDFParserConfig;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.SAXException;

import java.io.FileInputStream;
import java.io.IOException;

/**
 * @since 8/26/16
 */
public class Sample {
    public static void main(String[] args)
            throws IOException, TikaException, SAXException {
        Parser parser = new AutoDetectParser();
        BodyContentHandler handler = new BodyContentHandler(Integer.MAX_VALUE);

        TesseractOCRConfig config = new TesseractOCRConfig();
        PDFParserConfig pdfConfig = new PDFParserConfig();
        pdfConfig.setExtractInlineImages(true);

        ParseContext parseContext = new ParseContext();
        parseContext.set(TesseractOCRConfig.class, config);
        parseContext.set(PDFParserConfig.class, pdfConfig);
        //need to add this to make sure recursive parsing happens!
        parseContext.set(Parser.class, parser);

        FileInputStream stream = new FileInputStream("samplepdf.pdf");
        Metadata metadata = new Metadata();
        parser.parse(stream, handler, metadata, parseContext);
        System.out.println(metadata);
        String content = handler.toString();
        System.out.println("===============");
        System.out.println(content);
        System.out.println("Done");
    }
}

Run Code Online (Sandbox Code Playgroud)

Maven依赖:

<dependencies>
    <dependency>
      <groupId>org.apache.tika</groupId>
      <artifactId>tika-parsers</artifactId>
      <version>1.13</version>
    </dependency>
    <dependency>
      <groupId>com.levigo.jbig2</groupId>
      <artifactId>levigo-jbig2-imageio</artifactId>
      <version>1.6.5</version>
    </dependency>
  </dependencies>

Run Code Online (Sandbox Code Playgroud)

归档时间：	10 年，6 月前
查看次数：	9707 次
最近记录：	9 年，6 月前