小编Jon*_*han的帖子

使用pdfbox从pdf中删除不可见的文本

当我尝试从上面的pdf中提取文本时,我得到了一个在evince查看器中看不见的文本混合文本以及可见的文本.此外,一些所需的文本缺少观众中没有丢失的字符,例如"FALCONS"中的"S"和许多缺少的"½"字符.我认为这是由于隐形文本的干扰,因为当在查看器中突出显示pdf时,可以看到不可见文本与可见文本重叠.

有没有办法删除不可见的文本？还是有其他解决方案吗？

码:

import java.io.File;
import java.io.IOException;

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;


public class App {

    public static String getPdfText(String pdfPath) throws IOException {
        File file = new File(pdfPath);
        PDDocument document = null;
        PDFTextStripper textStripper = null;
        String text = null;

        try {
            document = PDDocument.load(file);
            textStripper = new PDFTextStripper();
            textStripper.setEndPage(1);
            text =  textStripper.getText(document);
        } catch (IOException e) {
            throw new IOException("Could not load file and strip text.", e);
        } finally {
            try {
                if (document != null)
                    document.close();
            } …

Run Code Online (Sandbox Code Playgroud)

java pdfbox

Jon*_*han

lucky-day

1
推荐指数

1
解决办法

1065
查看次数

标签统计

java ×1

pdfbox ×1

使用pdfbox从pdf中删除不可见的文本

标签 统计

小编Jon_han的帖子

标签统计