PDFBox 2.0 RC3 - 查找和替换文本

Question

PDFBox 2.0 RC3 - 查找和替换文本

如何使用PDFBox 2.0查找和替换PDF文档中的文本,他们提取旧的示例,它的语法不再有效,所以我想知道它是否仍然可行,如果是这样,最好的方法是什么.谢谢!

Answer 1

你可以尝试这样:

public static PDDocument replaceText(PDDocument document, String searchString, String replacement) throws IOException {
    if (Strings.isEmpty(searchString) || Strings.isEmpty(replacement)) {
        return document;
    }
    PDPageTree pages = document.getDocumentCatalog().getPages();
    for (PDPage page : pages) {
        PDFStreamParser parser = new PDFStreamParser(page);
        parser.parse();
        List tokens = parser.getTokens();
        for (int j = 0; j < tokens.size(); j++) {
            Object next = tokens.get(j);
            if (next instanceof Operator) {
                Operator op = (Operator) next;
                //Tj and TJ are the two operators that display strings in a PDF
                if (op.getName().equals("Tj")) {
                    // Tj takes one operator and that is the string to display so lets update that operator
                    COSString previous = (COSString) tokens.get(j - 1);
                    String string = previous.getString();
                    string = string.replaceFirst(searchString, replacement);
                    previous.setValue(string.getBytes());
                } else if (op.getName().equals("TJ")) {
                    COSArray previous = (COSArray) tokens.get(j - 1);
                    for (int k = 0; k < previous.size(); k++) {
                        Object arrElement = previous.getObject(k);
                        if (arrElement instanceof COSString) {
                            COSString cosString = (COSString) arrElement;
                            String string = cosString.getString();
                            string = StringUtils.replaceOnce(string, searchString, replacement);
                            cosString.setValue(string.getBytes());
                        }
                    }
                }
            }
        }
        // now that the tokens are updated we will replace the page content stream.
        PDStream updatedStream = new PDStream(document);
        OutputStream out = updatedStream.createOutputStream();
        ContentStreamWriter tokenWriter = new ContentStreamWriter(out);
        tokenWriter.writeTokens(tokens);
        page.setContents(updatedStream);
        out.close();
    }
    return document;
}

Run Code Online (Sandbox Code Playgroud)

此代码仅适用于非常简单的PDF,并且不会更改或(甚至更糟)损坏更复杂的PDF. (3认同)
https://pdfbox.apache.org/2.0/migration.html为什么删除了ReplaceText示例？ (3认同)
这在您提到的链接的最后一部分进行了解释：https://pdfbox.apache.org/2.0/migration.html#why-was-the-replacetext-example-removed 这主要是由于字符编码和字体问题. (2认同)

Answer 2

小智 5

我花了很多时间想出一个解决方案，最终获得了 Acrobat DC 订阅，这样我就可以创建字段作为要替换的文本的占位符。在我的例子中，这些字段用于客户信息和订单详细信息，因此不是非常复杂的数据，但文档中充满了业务相关条件的页面，并且布局非常复杂。

然后我就简单的做了这个，可能适合你。

private void update() throws InvalidPasswordException, IOException {
    Map<String, String> map = new HashMap<>();
    map.put("fieldname", "value to update");
    File template = new File("template.pdf");
    PDDocument document = PDDocument.load(template);
    List<PDField> fields = document.getDocumentCatalog().getAcroForm().getFields();
    for (PDField field : fields) {
        for (Map.Entry<String, String> entry : map.entrySet()) {
            if (entry.getKey().equals(field.getFullyQualifiedName())) {
                field.setValue(entry.getValue());
                field.setReadOnly(true);
            }
        }
    }
    File out = new File("out.pdf");
    document.save(out);
    document.close();
}

Run Code Online (Sandbox Code Playgroud)

青年MMV

使用 AcroForm 字段确实是 PDF 填写的正确方式。但您不需要 Acrobat 来创建字段，您也可以使用 PDFBox 来创建字段...（不过，没有漂亮的 GUI。） (4认同)

归档时间：	9 年，9 月前
查看次数：	9877 次
最近记录：	8 年前