如何使用java从pdf文件中获取原始文本

use*_*542 25 java pdf pdfbox

我有一些pdf文件,使用pdfbox我已将它们转换为文本并存储到文本文件中,现在从我要删除的文本文件

  1. 超链接
  2. 所有特殊字符
  3. 空白行
  4. 标题页脚的pdf文件
  5. "1)","2)","a)","子弹"等

我希望逐行获得有效的文本,如下所示:

我们提出了OntoGain,一种从纯文本中提取的多词概念术语进行本体学习的方法.OntoGain遵循由不同处理层定义的本体学习过程.在简单术语提取的基础上,通过聚类提取的概念来形成概念层次结构.然后,衍生的术语分类法充满了非分类关系.已经研究了几种不同的最先进的方法来实现每一层.OntoGain基于多词术语概念,因为多词或复合词具有比普通单词词更加坚实和独特的语义.我们选择了层次聚类方法和形式概念分析(FCA)算法来构建术语分类法.此外,应用关联规则算法来揭示非分类关系.还实现了一种尝试在关系概念之间执行最合适的泛化级别的方法.为了显示概念证明,实现了系统原型.OntoGain允许使用Jena Semantic Web Frame-work1将派生的本体转换为OWL.OntoGain应用于医学和计算机语料库两个独立的数据源,并将其结果与Text2Onto(一种最先进的本体学习方法)获得的类似结果进行比较.对11.5 CCD1.1结果的分析表明,OntoGain在精度方面比Text20nto表现更好,提取更正确的概念,而更有选择性地提取更少但更合理的概念.

我怎样才能做到这一点?

SAN*_*NN3 33

使用pdfbox我们可以实现这个目标

示例:

public static void main(String args[]) {

    PDFParser parser = null;
    PDDocument pdDoc = null;
    COSDocument cosDoc = null;
    PDFTextStripper pdfStripper;

    String parsedText;
    String fileName = "E:\\Files\\Small Files\\PDF\\JDBC.pdf";
    File file = new File(fileName);
    try {
        parser = new PDFParser(new FileInputStream(file));
        parser.parse();
        cosDoc = parser.getDocument();
        pdfStripper = new PDFTextStripper();
        pdDoc = new PDDocument(cosDoc);
        parsedText = pdfStripper.getText(pdDoc);
        System.out.println(parsedText.replaceAll("[^A-Za-z0-9. ]+", ""));
    } catch (Exception e) {
        e.printStackTrace();
        try {
            if (cosDoc != null)
                cosDoc.close();
            if (pdDoc != null)
                pdDoc.close();
        } catch (Exception e1) {
            e1.printStackTrace();
        }

    }
}
Run Code Online (Sandbox Code Playgroud)

  • parser = new PDFParser(new FileInputStream(file)); 给出错误 (3认同)
  • 对于较新版本:/sf/answers/4115903571/ (3认同)
  • 对于那些看这个的人来说,加载文档的最新方法是`pdDoc = PDDocument.load(new File(fileName)); pdfStripper = new PDFTextStripper();` (2认同)

Joh*_*lip 19

您好我们可以使用Apache Tika提取pdf文件

例子是:

import java.io.IOException;
import java.io.InputStream;
import java.util.HashMap;
import java.util.Map;
import org.apache.http.HttpEntity;
import org.apache.http.HttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.DefaultHttpClient;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.metadata.TikaCoreProperties;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.sax.BodyContentHandler;

public class WebPagePdfExtractor {

    public Map<String, Object> processRecord(String url) {
        DefaultHttpClient httpclient = new DefaultHttpClient();
        Map<String, Object> map = new HashMap<String, Object>();
        try {
            HttpGet httpGet = new HttpGet(url);
            HttpResponse response = httpclient.execute(httpGet);
            HttpEntity entity = response.getEntity();
            InputStream input = null;
            if (entity != null) {
                try {
                    input = entity.getContent();
                    BodyContentHandler handler = new BodyContentHandler();
                    Metadata metadata = new Metadata();
                    AutoDetectParser parser = new AutoDetectParser();
                    ParseContext parseContext = new ParseContext();
                    parser.parse(input, handler, metadata, parseContext);
                    map.put("text", handler.toString().replaceAll("\n|\r|\t", " "));
                    map.put("title", metadata.get(TikaCoreProperties.TITLE));
                    map.put("pageCount", metadata.get("xmpTPg:NPages"));
                    map.put("status_code", response.getStatusLine().getStatusCode() + "");
                } catch (Exception e) {
                    e.printStackTrace();
                } finally {
                    if (input != null) {
                        try {
                            input.close();
                        } catch (IOException e) {
                            e.printStackTrace();
                        }
                    }
                }
            }
        } catch (Exception exception) {
            exception.printStackTrace();
        }
        return map;
    }

    public static void main(String arg[]) {
        WebPagePdfExtractor webPagePdfExtractor = new WebPagePdfExtractor();
        Map<String, Object> extractedMap = webPagePdfExtractor.processRecord("http://math.about.com/library/q20.pdf");
        System.out.println(extractedMap.get("text"));
    }

}
Run Code Online (Sandbox Code Playgroud)

  • 刚刚尝试了 Tika,我可以轻松解析 pdf、powerpoint、文本文件、excel、csv 等,所有这些都只需几行适用于所有内容的代码。A+ 库,绝对推荐。 (3认同)
  • 注意:Apache Tika的PDF功能是Apache pdfbox的精简包装。 (2认同)

Ula*_*aga 13

您可以使用iText来做这些事情

//iText imports

import com.itextpdf.text.pdf.PdfReader;
import com.itextpdf.text.pdf.parser.PdfTextExtractor;
Run Code Online (Sandbox Code Playgroud)

例如:

try {     
    PdfReader reader = new PdfReader(INPUTFILE);
    int n = reader.getNumberOfPages(); 
    String str=PdfTextExtractor.getTextFromPage(reader, 2); //Extracting the content from a particular page.
    System.out.println(str);
    reader.close();
} catch (Exception e) {
    System.out.println(e);
}
Run Code Online (Sandbox Code Playgroud)

另一个

try {

    PdfReader reader = new PdfReader("c:/temp/test.pdf");
    System.out.println("This PDF has "+reader.getNumberOfPages()+" pages.");
    String page = PdfTextExtractor.getTextFromPage(reader, 2);
    System.out.println("Page Content:\n\n"+page+"\n\n");
    System.out.println("Is this document tampered: "+reader.isTampered());
    System.out.println("Is this document encrypted: "+reader.isEncrypted());
} catch (IOException e) {
    e.printStackTrace();
}
Run Code Online (Sandbox Code Playgroud)

上面的示例只能提取文本,但您需要做更多工作来删除超链接,项目符号,标题和数字.


Jat*_*tin 10

对于较新版本的Apache pdfbox这是原始来源的示例

/*
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 *
 *      http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */
package org.apache.pdfbox.examples.util;

import java.io.File;
import java.io.IOException;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.encryption.AccessPermission;
import org.apache.pdfbox.text.PDFTextStripper;

/**
 * This is a simple text extraction example to get started. For more advance usage, see the
 * ExtractTextByArea and the DrawPrintTextLocations examples in this subproject, as well as the
 * ExtractText tool in the tools subproject.
 *
 * @author Tilman Hausherr
 */
public class ExtractTextSimple
{
    private ExtractTextSimple()
    {
        // example class should not be instantiated
    }

    /**
     * This will print the documents text page by page.
     *
     * @param args The command line arguments.
     *
     * @throws IOException If there is an error parsing or extracting the document.
     */
    public static void main(String[] args) throws IOException
    {
        if (args.length != 1)
        {
            usage();
        }

        try (PDDocument document = PDDocument.load(new File(args[0])))
        {
            AccessPermission ap = document.getCurrentAccessPermission();
            if (!ap.canExtractContent())
            {
                throw new IOException("You do not have permission to extract text");
            }

            PDFTextStripper stripper = new PDFTextStripper();

            // This example uses sorting, but in some cases it is more useful to switch it off,
            // e.g. in some files with columns where the PDF content stream respects the
            // column order.
            stripper.setSortByPosition(true);

            for (int p = 1; p <= document.getNumberOfPages(); ++p)
            {
                // Set the page interval to extract. If you don't, then all pages would be extracted.
                stripper.setStartPage(p);
                stripper.setEndPage(p);

                // let the magic happen
                String text = stripper.getText(document);

                // do some nice output with a header
                String pageStr = String.format("page %d:", p);
                System.out.println(pageStr);
                for (int i = 0; i < pageStr.length(); ++i)
                {
                    System.out.print("-");
                }
                System.out.println();
                System.out.println(text.trim());
                System.out.println();

                // If the extracted text is empty or gibberish, please try extracting text
                // with Adobe Reader first before asking for help. Also read the FAQ
                // on the website: 
                // https://pdfbox.apache.org/2.0/faq.html#text-extraction
            }
        }
    }

    /**
     * This will print the usage for this document.
     */
    private static void usage()
    {
        System.err.println("Usage: java " + ExtractTextSimple.class.getName() + " <input-pdf>");
        System.exit(-1);
    }
}
Run Code Online (Sandbox Code Playgroud)