如何使用pdfbox从pdf中提取粗体文本？

Question

如何使用pdfbox从pdf中提取粗体文本？

我正在使用Apache pdfbox来提取文本.我可以从pdf中提取文本,但我不知道如何知道这个词是否是粗体??? (代码建议会很好!!!)这是从pdf中提取纯文本的代码.

PDDocument document = PDDocument
    .load("/home/lipu/workspace/MRCPTester/test.pdf");
document.getClass();
if (document.isEncrypted()) {
    try {
        document.decrypt("");
    } catch (InvalidPasswordException e) {
        System.err.println("Error: Document is encrypted with a password.");
        System.exit(1);
    }
}

// PDFTextStripperByArea stripper = new PDFTextStripperByArea();
// stripper.setSortByPosition(true);
PDFTextStripper stripper = new PDFTextStripper();
stripper.setStartPage(1);
stripper.setEndPage(2);
stripper.setSortByPosition(true);
String st = stripper.getText(document);

Run Code Online (Sandbox Code Playgroud)

Answer 1

mkl*_*mkl 19

结果PDFTextStripper是纯文本.因此,提取后,为时已晚.但是你可以覆盖它的某些方法,只允许通过根据你的意愿格式化的文本.

如果PDFTextStripper您必须覆盖

protected void processTextPosition( TextPosition text )

Run Code Online (Sandbox Code Playgroud)

在您的覆盖中,您可以检查相关文本是否满足您的要求(TextPosition包含有关文本的大量信息,而不仅仅是文本本身),如果有,请转发TextPosition text给super实现.

但主要问题是识别哪个文本是粗体.

对于大胆的标准可以是单词加粗的字体名称,例如传讯BoldOblique -您访问的文字使用的字体text.getFont()和字体的Postscript名称使用字体的getBaseFont()方法

String postscriptName = text.getFont().getBaseFont();

Run Code Online (Sandbox Code Playgroud)

条件也可以来自字体描述符 - 使用该getFontDescriptor方法获取字体的字体描述符,字体描述符具有可选的字体权重值

float fontWeight = text.getFont().getFontDescriptor().getFontWeight();

Run Code Online (Sandbox Code Playgroud)

该值定义为

(可选; PDF 1.5;应用于标记PDF文档中的Type 3字体)完全限定字体名称或字体说明符的权重(厚度)组件.可能的值应为100,200,300,400,500,600,700,800或900,其中每个数字表示的重量至少与其前身一样暗.值400表示正常体重; 700应表示粗体.

这些值的具体解释因字体而异.

一种字体的示例300可以看起来与另一种字体中的500最相似.

(表122,第9.8.1节,ISO 32000-1)

可能还有其他提示要检查粗体,例如大线宽

double lineWidth = getGraphicsState().getLineWidth();

Run Code Online (Sandbox Code Playgroud)

当渲染模式也绘制轮廓时:

int renderingMode = getGraphicsState().getTextState().getRenderingMode();

Run Code Online (Sandbox Code Playgroud)

您可能需要尝试使用手头的文件,这些标准就足够了.

@lujop 当我写下答案`processTextPosition` 是唯一可以适当覆盖的方法时，将结果传输到输出是很困难的。同时（在 1.8.11 和 2.0.x 中）`writeString` 也成为一种可用的重写方法，并且由于该方法更接近最终输出，它可以用于注入粗体等标签，参见。[这个答案](http://stackoverflow.com/a/40039407/1729265)。但请记住，有很多方法可以创建粗体文本，参见。[这个答案](http://stackoverflow.com/a/26642060/1729265)。对于通用解决方案，您必须检查所有这些。 (2认同)

归档时间：	12 年，2 月前
查看次数：	6779 次
最近记录：	11 年前