从 PDFBox 中剥离时的文本坐标

Sam*_*lla 5 java pdfbox

我正在尝试使用 PDFBox 从 pdf 文件中提取带有坐标的文本。

我混合了在互联网上找到的一些方法/信息(也是stackoverflow),但是我的坐标问题似乎不正确。例如,当我尝试使用坐标在 tex 顶部绘制矩形时,该矩形会在其他地方绘制。

这是我的代码(请不要判断风格,写得很快只是为了测试)

文本行.java

    import java.util.List;
    import org.apache.pdfbox.text.TextPosition;

    /**
     *
     * @author samue
     */
    public class TextLine {
        public List<TextPosition> textPositions = null;
        public String text = "";
    }
Run Code Online (Sandbox Code Playgroud)

myStripper.java

    import java.io.IOException;
    import java.util.ArrayList;
    import java.util.List;
    import org.apache.pdfbox.pdmodel.PDDocument;
    import org.apache.pdfbox.pdmodel.PDPage;
    import org.apache.pdfbox.text.PDFTextStripper;
    import org.apache.pdfbox.text.TextPosition;

    /*
     * To change this license header, choose License Headers in Project Properties.
     * To change this template file, choose Tools | Templates
     * and open the template in the editor.
     */

    /**
     *
     * @author samue
     */
    public class myStripper extends PDFTextStripper {
        public myStripper() throws IOException
        {
        }

        @Override
        protected void startPage(PDPage page) throws IOException
        {
            startOfLine = true;
            super.startPage(page);
        }

        @Override
        protected void writeLineSeparator() throws IOException
        {
            startOfLine = true;
            super.writeLineSeparator();
        }

        @Override
        public String getText(PDDocument doc) throws IOException
        {
            lines = new ArrayList<TextLine>();
            return super.getText(doc);
        }

        @Override
        protected void writeWordSeparator() throws IOException
        {
            TextLine tmpline = null;

            tmpline = lines.get(lines.size() - 1);
            tmpline.text += getWordSeparator();

            super.writeWordSeparator();
        }


        @Override
        protected void writeString(String text, List<TextPosition> textPositions) throws IOException
        {
            TextLine tmpline = null;

            if (startOfLine) {
                tmpline = new TextLine();
                tmpline.text = text;
                tmpline.textPositions = textPositions;
                lines.add(tmpline);
            } else {
                tmpline = lines.get(lines.size() - 1);
                tmpline.text += text;
                tmpline.textPositions.addAll(textPositions);
            }

            if (startOfLine)
            {
                startOfLine = false;
            }
            super.writeString(text, textPositions);
        }

        boolean startOfLine = true;
        public ArrayList<TextLine> lines = null;

    }
Run Code Online (Sandbox Code Playgroud)

AWT 按钮上的单击事件

 private void jButton1MouseClicked(java.awt.event.MouseEvent evt) {                                      
    // TODO add your handling code here:
    try {
        File file = new File("C:\\Users\\samue\\Desktop\\mwb_I_201711.pdf");
        PDDocument doc = PDDocument.load(file);

        myStripper stripper = new myStripper();

        stripper.setStartPage(1); // fix it to first page just to test it
        stripper.setEndPage(1);
        stripper.getText(doc);

        TextLine line = stripper.lines.get(1); // the line i want to paint on

        float minx = -1;
        float maxx = -1;

        for (TextPosition pos: line.textPositions)
        {
            if (pos == null)
                continue;

            if (minx == -1 || pos.getTextMatrix().getTranslateX() < minx) {
                minx = pos.getTextMatrix().getTranslateX();
            }
            if (maxx == -1 || pos.getTextMatrix().getTranslateX() > maxx) {
                maxx = pos.getTextMatrix().getTranslateX();
            }
        }

        TextPosition firstPosition = line.textPositions.get(0);
        TextPosition lastPosition = line.textPositions.get(line.textPositions.size() - 1);

        float x = minx;
        float y = firstPosition.getTextMatrix().getTranslateY();
        float w = (maxx - minx) + lastPosition.getWidth();
        float h = lastPosition.getHeightDir();

        PDPageContentStream contentStream = new PDPageContentStream(doc, doc.getPage(0), PDPageContentStream.AppendMode.APPEND, false);

        contentStream.setNonStrokingColor(Color.RED);
        contentStream.addRect(x, y, w, h);
        contentStream.fill();
        contentStream.close();

        File fileout = new File("C:\\Users\\samue\\Desktop\\pdfbox.pdf");
        doc.save(fileout);
        doc.close();
    } catch (Exception ex) {

    }
}                                     
Run Code Online (Sandbox Code Playgroud)

有什么建议吗?我究竟做错了什么?

mkl*_*mkl 5

这只是过度PdfTextStripper坐标归一化的另一种情况。就像你一样,我曾认为通过使用TextPosition.getTextMatrix()(而不是getX()and getY)可以获得实际坐标,但是不,即使这些矩阵值也必须纠正(至少在 PDFBox 2.0.x 中,我还没有检查 1.8.x)因为矩阵乘以平移,使得裁剪框的左下角成为原点。

因此,在您的情况下(其中裁剪框的左下角不是原点),您必须更正这些值,例如通过替换

        float x = minx;
        float y = firstPosition.getTextMatrix().getTranslateY();
Run Code Online (Sandbox Code Playgroud)

经过

        PDRectangle cropBox = doc.getPage(0).getCropBox();

        float x = minx + cropBox.getLowerLeftX();
        float y = firstPosition.getTextMatrix().getTranslateY() + cropBox.getLowerLeftY();
Run Code Online (Sandbox Code Playgroud)

代替

不加修正

你现在得到

带 x,y 校正

但显然,您还必须稍微修正高度。PdfTextStripper这是由于确定文本高度的方式所致:

    // 1/2 the bbox is used as the height todo: why?
    float glyphHeight = bbox.getHeight() / 2;
Run Code Online (Sandbox Code Playgroud)

(来自showGlyph(...)LegacyPDFStreamEngine父类PdfTextStripper

虽然字体边界框通常确实太大,但一半通常是不够的。