从具有不同高度的表格行中提取pdf文本(java使用pdfbox库)

A. *_*oza 3 java rectangles pdfbox

黑色形状是需要提取的文本:

黑色形状是需要提取的文本

到目前为止,我已经从列中提取了文本,但是是手动提取的,因为只有 5 个(对区域使用 Rectangle 类)。我的问题是:有没有办法对行执行此操作,因为矩形的大小(高度)不同,并且手动对 50 多行执行此操作将是一种暴行?更具体地说,我可以使用函数根据每行的高度更改矩形吗?或者有什么建议可能有帮助吗?

mkl*_*mkl 6

正如评论中所建议的,您可以通过解析页面的矢量图形指令来自动识别示例 PDF 的表格单元格区域。

\n

对于这样的任务,您可以扩展 PDFBox PDFGraphicsStreamEngine,它提供了调用路径构建和绘图指令的抽象方法。

\n

注意:我在这里展示的流引擎类专门用于识别绘制为填充黑色的长而小矩形的表格单元格框架线,如示例文档中使用的那样。对于通用解决方案,您至少应该识别绘制为矢量图形线段或描边矩形的框架线。

\n

流引擎类PdfBoxFinder

\n

该流引擎类收集水平线的y坐标范围和垂直线的x坐标范围,然后提供由这些坐标范围定义的网格框。特别是,这意味着不支持行跨度或列跨度;在当前的情况下,这是可以的,因为没有这样的跨度。

\n\n
public class PdfBoxFinder extends PDFGraphicsStreamEngine {\n    /**\n     * Supply the page to analyze here; to analyze multiple pages\n     * create multiple {@link PdfBoxFinder} instances.\n     */\n    public PdfBoxFinder(PDPage page) {\n        super(page);\n    }\n\n    /**\n     * The boxes ({@link Rectangle2D} instances with coordinates according to\n     * the PDF coordinate system, e.g. for decorating the table cells) the\n     * {@link PdfBoxFinder} has recognized on the current page.\n     */\n    public Map<String, Rectangle2D> getBoxes() {\n        consolidateLists();\n        Map<String, Rectangle2D> result = new HashMap<>();\n        if (!horizontalLines.isEmpty() && !verticalLines.isEmpty())\n        {\n            Interval top = horizontalLines.get(horizontalLines.size() - 1);\n            char rowLetter = \'A\';\n            for (int i = horizontalLines.size() - 2; i >= 0; i--, rowLetter++) {\n                Interval bottom = horizontalLines.get(i);\n                Interval left = verticalLines.get(0);\n                int column = 1;\n                for (int j = 1; j < verticalLines.size(); j++, column++) {\n                    Interval right = verticalLines.get(j);\n                    String name = String.format("%s%s", rowLetter, column);\n                    Rectangle2D rectangle = new Rectangle2D.Float(left.from, bottom.from, right.to - left.from, top.to - bottom.from);\n                    result.put(name, rectangle);\n                    left = right;\n                }\n                top = bottom;\n            }\n        }\n        return result;\n    }\n\n    /**\n     * The regions ({@link Rectangle2D} instances with coordinates according\n     * to the PDFBox text extraction API, e.g. for initializing the regions of\n     * a {@link PDFTextStripperByArea}) the {@link PdfBoxFinder} has recognized\n     * on the current page.\n     */\n    public Map<String, Rectangle2D> getRegions() {\n        PDRectangle cropBox = getPage().getCropBox();\n        float xOffset = cropBox.getLowerLeftX();\n        float yOffset = cropBox.getUpperRightY();\n        Map<String, Rectangle2D> result = getBoxes();\n        for (Map.Entry<String, Rectangle2D> entry : result.entrySet()) {\n            Rectangle2D box = entry.getValue();\n            Rectangle2D region = new Rectangle2D.Float(xOffset + (float)box.getX(), yOffset - (float)(box.getY() + box.getHeight()), (float)box.getWidth(), (float)box.getHeight());\n            entry.setValue(region);\n        }\n        return result;\n    }\n\n    /**\n     * <p>\n     * Processes the path elements currently in the {@link #path} list and\n     * eventually clears the list.\n     * </p>\n     * <p>\n     * Currently only elements are considered which \n     * </p>\n     * <ul>\n     * <li>are {@link Rectangle} instances;\n     * <li>are filled fairly black;\n     * <li>have a thin and long form; and\n     * <li>have sides fairly parallel to the coordinate axis.\n     * </ul>\n     */\n    void processPath() throws IOException {\n        PDColor color = getGraphicsState().getNonStrokingColor();\n        if (!isBlack(color)) {\n            logger.debug("Dropped path due to non-black fill-color.");\n            return;\n        }\n\n        for (PathElement pathElement : path) {\n            if (pathElement instanceof Rectangle) {\n                Rectangle rectangle = (Rectangle) pathElement;\n\n                double p0p1 = rectangle.p0.distance(rectangle.p1);\n                double p1p2 = rectangle.p1.distance(rectangle.p2);\n                boolean p0p1small = p0p1 < 3;\n                boolean p1p2small = p1p2 < 3;\n\n                if (p0p1small) {\n                    if (p1p2small) {\n                        logger.debug("Dropped rectangle too small on both sides.");\n                    } else {\n                        processThinRectangle(rectangle.p0, rectangle.p1, rectangle.p2, rectangle.p3);\n                    }\n                } else if (p1p2small) {\n                    processThinRectangle(rectangle.p1, rectangle.p2, rectangle.p3, rectangle.p0);\n                } else {\n                    logger.debug("Dropped rectangle too large on both sides.");\n                }\n            }\n        }\n        path.clear();\n    }\n\n    /**\n     * The argument points shall be sorted to have (p0, p1) and (p2, p3) be the small\n     * edges and (p1, p2) and (p3, p0) the long ones.\n     */\n    void processThinRectangle(Point2D p0, Point2D p1, Point2D p2, Point2D p3) {\n        float longXDiff = (float)Math.abs(p2.getX() - p1.getX());\n        float longYDiff = (float)Math.abs(p2.getY() - p1.getY());\n        boolean longXDiffSmall = longXDiff * 10 < longYDiff;\n        boolean longYDiffSmall = longYDiff * 10 < longXDiff;\n\n        if (longXDiffSmall) {\n            verticalLines.add(new Interval(p0.getX(), p1.getX(), p2.getX(), p3.getX()));\n        } else if (longYDiffSmall) {\n            horizontalLines.add(new Interval(p0.getY(), p1.getY(), p2.getY(), p3.getY()));\n        } else {\n            logger.debug("Dropped rectangle too askew.");\n        }\n    }\n\n    /**\n     * Sorts the {@link #horizontalLines} and {@link #verticalLines} lists and\n     * merges fairly identical entries.\n     */\n    void consolidateLists() {\n        for (List<Interval> intervals : Arrays.asList(horizontalLines, verticalLines)) {\n            intervals.sort(null);\n            for (int i = 1; i < intervals.size();) {\n                if (intervals.get(i-1).combinableWith(intervals.get(i))) {\n                    Interval interval = intervals.get(i-1).combineWith(intervals.get(i));\n                    intervals.set(i-1, interval);\n                    intervals.remove(i);\n                } else {\n                    i++;\n                }\n            }\n        }\n    }\n\n    /**\n     * Checks whether the given color is black\'ish.\n     */\n    boolean isBlack(PDColor color) throws IOException {\n        int value = color.toRGB();\n        for (int i = 0; i < 2; i++) {\n            int component = value & 0xff;\n            if (component > 5)\n                return false;\n            value /= 256;\n        }\n        return true;\n    }\n\n    //\n    // PDFGraphicsStreamEngine overrides\n    //\n    @Override\n    public void appendRectangle(Point2D p0, Point2D p1, Point2D p2, Point2D p3) throws IOException {\n        path.add(new Rectangle(p0, p1, p2, p3));\n    }\n\n    @Override\n    public void endPath() throws IOException {\n        path.clear();\n    }\n\n    @Override\n    public void strokePath() throws IOException {\n        path.clear();\n    }\n\n    @Override\n    public void fillPath(int windingRule) throws IOException {\n        processPath();\n    }\n\n    @Override\n    public void fillAndStrokePath(int windingRule) throws IOException {\n        processPath();\n    }\n\n    @Override public void drawImage(PDImage pdImage) throws IOException { }\n    @Override public void clip(int windingRule) throws IOException { }\n    @Override public void moveTo(float x, float y) throws IOException { }\n    @Override public void lineTo(float x, float y) throws IOException { }\n    @Override public void curveTo(float x1, float y1, float x2, float y2, float x3, float y3) throws IOException { }\n    @Override public Point2D getCurrentPoint() throws IOException { return null; }\n    @Override public void closePath() throws IOException { }\n    @Override public void shadingFill(COSName shadingName) throws IOException { }\n\n    //\n    // inner classes\n    //\n    class Interval implements Comparable<Interval> {\n        final float from;\n        final float to;\n\n        Interval(float... values) {\n            Arrays.sort(values);\n            this.from = values[0];\n            this.to = values[values.length - 1];\n        }\n\n        Interval(double... values) {\n            Arrays.sort(values);\n            this.from = (float) values[0];\n            this.to = (float) values[values.length - 1];\n        }\n\n        boolean combinableWith(Interval other) {\n            if (this.from > other.from)\n                return other.combinableWith(this);\n            if (this.to < other.from)\n                return false;\n            float intersectionLength = Math.min(this.to, other.to) - other.from;\n            float thisLength = this.to - this.from;\n            float otherLength = other.to - other.from;\n            return (intersectionLength >= thisLength * .9f) || (intersectionLength >= otherLength * .9f);\n        }\n\n        Interval combineWith(Interval other) {\n            return new Interval(this.from, this.to, other.from, other.to);\n        }\n\n        @Override\n        public int compareTo(Interval o) {\n            return this.from == o.from ? Float.compare(this.to, o.to) : Float.compare(this.from, o.from);\n        }\n\n        @Override\n        public String toString() {\n            return String.format("[%3.2f, %3.2f]", from, to);\n        }\n    }\n\n    interface PathElement {\n    }\n\n    class Rectangle implements PathElement {\n        final Point2D p0, p1, p2, p3;\n\n        Rectangle(Point2D p0, Point2D p1, Point2D p2, Point2D p3) {\n            this.p0 = p0;\n            this.p1 = p1;\n            this.p2 = p2;\n            this.p3 = p3;\n        }\n    }\n\n    //\n    // members\n    //\n    final List<PathElement> path = new ArrayList<>();\n    final List<Interval> horizontalLines = new ArrayList<>();\n    final List<Interval> verticalLines = new ArrayList<>();\n    final Logger logger = LoggerFactory.getLogger(PdfBoxFinder.class);\n}\n
Run Code Online (Sandbox Code Playgroud)\n

PdfBoxFinder.java

\n

使用示例

\n

您可以使用PdfBoxFinder类似的方法从位于以下位置的示例文档的表格单元格中提取文本FILE_PATH

\n
try (   PDDocument document = PDDocument.load(FILE_PATH) ) {\n    for (PDPage page : document.getDocumentCatalog().getPages()) {\n        PdfBoxFinder boxFinder = new PdfBoxFinder(page);\n        boxFinder.processPage(page);\n\n        PDFTextStripperByArea stripperByArea = new PDFTextStripperByArea();\n        for (Map.Entry<String, Rectangle2D> entry : boxFinder.getRegions().entrySet()) {\n            stripperByArea.addRegion(entry.getKey(), entry.getValue());\n        }\n\n        stripperByArea.extractRegions(page);\n        List<String> names = stripperByArea.getRegions();\n        names.sort(null);\n        for (String name : names) {\n            System.out.printf("[%s] %s\\n", name, stripperByArea.getTextForRegion(name));\n        }\n    }\n}\n
Run Code Online (Sandbox Code Playgroud)\n

ExtractBoxedText测试testExtractBoxedTexts

\n

输出的开始:

\n
[A1] Nr. \ncrt. \n\n[A2] Nume \xc5\x9fi prenume \n\n[A3] Titlul lucr\xc4\x83rii \n\n[A4] Coordonator \xc5\x9ftiin\xc5\xa3ific \n\n[A5] Ora \n\n[B1] 1. \n\n[B2] SFETCU I. JESSICA-\nLARISA \n\n[B3] Analiza fluxurilor de date twitter \n\n[B4] Conf. univ. dr. Fr\xc3\xaencu Marc \nEduard \n \n\n[B5] 8:00 \n\n[C1] 2. \n\n[C2] TARBA V. IONU\xc8\x9a-\nADRIAN \n\n[C3] Test me - rest api folosind java \xc5\x9fi \nplay framework \n\n[C4] Conf.univ.dr. Forti\xc5\x9f Teodor \nFlorin \n \n\n[C5] 8:12 \n
Run Code Online (Sandbox Code Playgroud)\n

文档第一页:

\n

截屏

\n