kab*_*hra 4 java pdf text hyperlink pdfbox
我正在尝试使用PDFBox从PDF中提取超链接信息,但是我不确定如何获取
for( Object p : pages ) {
PDPage page = (PDPage)p;
List<?> annotations = page.getAnnotations();
for( Object a : annotations ) {
PDAnnotation annotation = (PDAnnotation)a;
if( annotation instanceof PDAnnotationLink ) {
PDAnnotationLink link = (PDAnnotationLink)annotation;
System.out.println(link.toString());
System.out.println(link.getDestination());
}
}
}
Run Code Online (Sandbox Code Playgroud)
我想提取超链接目标的网址和超链接的文本。一个人怎么能做到呢?
谢谢
从源代码下载的PrintURLs示例代码中使用以下代码:
for( PDPage page : doc.getPages() )
{
pageNum++;
PDFTextStripperByArea stripper = new PDFTextStripperByArea();
List<PDAnnotation> annotations = page.getAnnotations();
//first setup text extraction regions
for( int j=0; j<annotations.size(); j++ )
{
PDAnnotation annot = annotations.get(j);
if( annot instanceof PDAnnotationLink )
{
PDAnnotationLink link = (PDAnnotationLink)annot;
PDRectangle rect = link.getRectangle();
//need to reposition link rectangle to match text space
float x = rect.getLowerLeftX();
float y = rect.getUpperRightY();
float width = rect.getWidth();
float height = rect.getHeight();
int rotation = page.getRotation();
if( rotation == 0 )
{
PDRectangle pageSize = page.getMediaBox();
y = pageSize.getHeight() - y;
}
else if( rotation == 90 )
{
//do nothing
}
Rectangle2D.Float awtRect = new Rectangle2D.Float( x,y,width,height );
stripper.addRegion( "" + j, awtRect );
}
}
stripper.extractRegions( page );
for( int j=0; j<annotations.size(); j++ )
{
PDAnnotation annot = annotations.get(j);
if( annot instanceof PDAnnotationLink )
{
PDAnnotationLink link = (PDAnnotationLink)annot;
PDAction action = link.getAction();
String urlText = stripper.getTextForRegion( "" + j );
if( action instanceof PDActionURI )
{
PDActionURI uri = (PDActionURI)action;
System.out.println( "Page " + pageNum +":'" + urlText.trim() + "'=" + uri.getURI() );
}
}
}
}
Run Code Online (Sandbox Code Playgroud)
它分为两个部分,一个是获取简单的URL,另一个是获取URL文本,这是通过在注释的矩形处提取文本来完成的。
| 归档时间: |
|
| 查看次数: |
2717 次 |
| 最近记录: |