如何提取超链接信息PDFBox

Question

如何提取超链接信息PDFBox

kab*_*hra 4 java pdf text hyperlink pdfbox

我正在尝试使用PDFBox从PDF中提取超链接信息，但是我不确定如何获取

for( Object p : pages ) {
    PDPage page = (PDPage)p;

    List<?> annotations = page.getAnnotations();
    for( Object a : annotations ) {
        PDAnnotation annotation = (PDAnnotation)a;

        if( annotation instanceof PDAnnotationLink ) {
            PDAnnotationLink link = (PDAnnotationLink)annotation;
            System.out.println(link.toString());
            System.out.println(link.getDestination());

        }
    }

}

Run Code Online (Sandbox Code Playgroud)

我想提取超链接目标的网址和超链接的文本。一个人怎么能做到呢？

谢谢

Answer 1

Til*_*err 5

从源代码下载的PrintURLs示例代码中使用以下代码：

for( PDPage page : doc.getPages() )
{
    pageNum++;
    PDFTextStripperByArea stripper = new PDFTextStripperByArea();
    List<PDAnnotation> annotations = page.getAnnotations();
    //first setup text extraction regions
    for( int j=0; j<annotations.size(); j++ )
    {
        PDAnnotation annot = annotations.get(j);
        if( annot instanceof PDAnnotationLink )
        {
            PDAnnotationLink link = (PDAnnotationLink)annot;
            PDRectangle rect = link.getRectangle();
            //need to reposition link rectangle to match text space
            float x = rect.getLowerLeftX();
            float y = rect.getUpperRightY();
            float width = rect.getWidth();
            float height = rect.getHeight();
            int rotation = page.getRotation();
            if( rotation == 0 )
            {
                PDRectangle pageSize = page.getMediaBox();
                y = pageSize.getHeight() - y;
            }
            else if( rotation == 90 )
            {
                //do nothing
            }

            Rectangle2D.Float awtRect = new Rectangle2D.Float( x,y,width,height );
            stripper.addRegion( "" + j, awtRect );
        }
    }

    stripper.extractRegions( page );

    for( int j=0; j<annotations.size(); j++ )
    {
        PDAnnotation annot = annotations.get(j);
        if( annot instanceof PDAnnotationLink )
        {
            PDAnnotationLink link = (PDAnnotationLink)annot;
            PDAction action = link.getAction();
            String urlText = stripper.getTextForRegion( "" + j );
            if( action instanceof PDActionURI )
            {
                PDActionURI uri = (PDActionURI)action;
                System.out.println( "Page " + pageNum +":'" + urlText.trim() + "'=" + uri.getURI() );
            }
        }
    }
}

Run Code Online (Sandbox Code Playgroud)

它分为两个部分，一个是获取简单的URL，另一个是获取URL文本，这是通过在注释的矩形处提取文本来完成的。

归档时间：	9 年，5 月前
查看次数：	2717 次
最近记录：	8 年前