我将两个PDF文件与PDFBOX版本2合并为一个。第一个得到字体:
name type encoding emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
XXMGEM+Arial-BoldMT TrueType WinAnsi yes yes yes 15 0
XXMGEM+ArialMT TrueType WinAnsi yes yes yes 19 0
XXMGEM+ArialMT CID TrueType Identity-H yes yes yes 27 0
XXMGEM+ArialNarrow-Bold TrueType WinAnsi yes yes yes 40 0
XXMGEM+ArialNarrow TrueType WinAnsi yes yes yes 44 0
Run Code Online (Sandbox Code Playgroud)
第二个:
name type encoding emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
UNTWVR+HelveticaLTCom-Roman CID TrueType Identity-H yes yes yes 25 0
UNTYID+HelveticaLTCom-Bold CID TrueType Identity-H yes yes yes 26 0
UNTZUP+ArialMT CID TrueType Identity-H yes yes yes 27 0
UNUBHB+Arial-BoldMT CID TrueType Identity-H yes yes yes 28 0
Helvetica-Bold Type 1 WinAnsi no no no 29 0
UNXPUH+HelveticaLTCom-Roman CID TrueType Identity-H yes yes yes 50 0
UNXRGT+HelveticaLTCom-Bold CID TrueType Identity-H yes yes yes 51 0
UNXSTF+ArialMT CID TrueType Identity-H yes yes yes 52 0
UNXUFR+Arial-BoldMT CID TrueType Identity-H yes yes yes 53 0
Run Code Online (Sandbox Code Playgroud)
合并后,将发生以下情况:
name type encoding emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
SRWYVL+HelveticaLTCom-Roman CID TrueType Identity-H yes yes yes 420 0
SRXAHX+HelveticaLTCom-Bold CID TrueType Identity-H yes yes yes 421 0
SRXBUJ+ArialMT CID TrueType Identity-H yes yes yes 422 0
SRXDGV+Arial-BoldMT CID TrueType Identity-H yes yes yes 423 0
Helvetica-Bold Type 1 WinAnsi no no no 424 0
SRWYVL+HelveticaLTCom-Roman CID TrueType Identity-H yes yes yes 425 0
SRXAHX+HelveticaLTCom-Bold CID TrueType Identity-H yes yes yes 426 0
SRXBUJ+ArialMT CID TrueType Identity-H yes yes yes 427 0
SRXDGV+Arial-BoldMT CID TrueType Identity-H yes yes yes 428 0
SRWYVL+ArialMT CID TrueType Identity-H yes yes yes 429 0
SRXAHX+HelveticaLTCom-Roman CID TrueType Identity-H yes yes yes 430 0
SRXBUJ+HelveticaLTCom-Bold CID TrueType Identity-H yes yes yes 431 0
SRXDGV+Arial-BoldMT CID TrueType Identity-H yes yes yes 432 0
WDEGAT+Arial-BoldMT TrueType WinAnsi yes yes yes 436 0
GSEDXU+ArialMT TrueType WinAnsi yes yes yes 437 0
Arial TrueType WinAnsi yes no no 416 0
ZapfDingbats TrueType WinAnsi yes no yes 419 0
ArialNarrow TrueType WinAnsi yes no no 417 0
ACHRDX+ZapfDingbats TrueType WinAnsi yes yes yes 618 0
ACHRDX+ZapfDingbats TrueType WinAnsi yes yes yes 619 0
ACHRDX+ZapfDingbats TrueType WinAnsi yes yes yes 620 0
ACHRDX+ZapfDingbats TrueType WinAnsi yes yes yes 621 0
ACHRDX+ZapfDingbats TrueType WinAnsi yes yes yes 622 0
GSEDXU+ArialNarrow-Bold TrueType WinAnsi yes yes yes 560 0
NVGLHQ+ArialNarrow TrueType WinAnsi yes yes yes 561 0
KWHHMM+ArialMT CID TrueType Identity-H yes yes yes 578 0
Run Code Online (Sandbox Code Playgroud)
我在Java中的代码:
final PDFMergerUtility pdfMerger = new PDFMergerUtility();
pdfMerger.setDestinationStream(outputStream);
pdfMerger.addSources(additionalPdfStreams);
pdfMerger.addSource(inputStreamPdDocument);
pdfMerger.mergeDocuments(MemoryUsageSetting.setupMainMemoryOnly());
Run Code Online (Sandbox Code Playgroud)
问题是来自第三方供应商的Api对此字体有问题。所以:我在做什么错,我该如何删除未使用和加倍的字体?
“重复”问题似乎来自多个页面,因为每个页面都包含自己的字体元数据。如果您遍历页面并获取字体名称,那么如果一种字体在多个页面中使用,您将在输出中看到重复项。
不过,问题中的细节似乎有些不对劲。两个源文件都没有ZapfDingbats字体,那么它从哪里进入合并文档?
首先,我写了几个辅助方法:
static String mergePdfs(InputStream is1, InputStream is2) throws IOException {
PDFMergerUtility pdfMerger = new PDFMergerUtility();
pdfMerger.addSource(is1);
pdfMerger.addSource(is2);
String destFile = System.getProperty("java.io.tmpdir") + System.nanoTime() + ".pdf";
pdfMerger.setDestinationFileName(destFile);
pdfMerger.mergeDocuments(MemoryUsageSetting.setupMainMemoryOnly());
return destFile;
}
static List<String> getFontNames(PDDocument doc) throws IOException {
List<String> result = new ArrayList<>();
for (int i=0; i < doc.getNumberOfPages(); i++){
PDPage page = doc.getPage(i);
PDResources res = page.getResources();
for (COSName fontName : res.getFontNames()) {
result.add(res.getFont(fontName).toString());
}
}
return result;
}
Run Code Online (Sandbox Code Playgroud)
然后我创建了 3 个测试 PDF 文档。第2,test-pdf-1.pdf并且test-pdf-2.pdf包含每一个页面,并使用相同的两种字体:PDTrueTypeFont BAAAAA+ArialMT和PDTrueTypeFont CAAAAA+Roboto-Black。第三个,test-pdf-3.pdf包含来自前两个文档的 2 页,是使用文本编辑器创建的,而不是使用 PDFBox。
然后添加以下测试代码:
Class clazz = Test.class;
String src1, src2, src3;
src1 = "/test-pdf-1.pdf";
src2 = "/test-pdf-2.pdf";
src3 = "/test-pdf-3.pdf";
InputStream is1, is2, is3;
is1 = clazz.getResourceAsStream(src1);
is2 = clazz.getResourceAsStream(src2);
String merged = mergePdfs(is1, is2);
PDDocument doc1, doc2, doc3, doc4;
is1 = clazz.getResourceAsStream(src1);
doc1 = PDDocument.load(is1);
is2 = clazz.getResourceAsStream(src2);
doc2 = PDDocument.load(is2);
is3 = clazz.getResourceAsStream(src3);
doc3 = PDDocument.load(is3);
doc4 = PDDocument.load(new File(merged));
System.out.println(src1 + " >\n\t" + getFontNames(doc1));
System.out.println(src2 + " >\n\t" + getFontNames(doc2));
System.out.println(src3 + " >\n\t" + getFontNames(doc3));
System.out.println(merged + " >\n\t" + getFontNames(doc4));
Run Code Online (Sandbox Code Playgroud)
输出如下(为了便于阅读和比较,我截断了最后一个文件名):
/test-pdf-1.pdf >
[PDTrueTypeFont BAAAAA+ArialMT, PDTrueTypeFont CAAAAA+Roboto-Black]
/test-pdf-2.pdf >
[PDTrueTypeFont BAAAAA+ArialMT, PDTrueTypeFont CAAAAA+Roboto-Black]
/test-pdf-3.pdf >
[PDTrueTypeFont BAAAAA+ArialMT, PDTrueTypeFont CAAAAA+Roboto-Black, PDTrueTypeFont BAAAAA+ArialMT, PDTrueTypeFont CAAAAA+Roboto-Black]
C:\Temp\..9.pdf >
[PDTrueTypeFont BAAAAA+ArialMT, PDTrueTypeFont CAAAAA+Roboto-Black, PDTrueTypeFont BAAAAA+ArialMT, PDTrueTypeFont CAAAAA+Roboto-Black]
Run Code Online (Sandbox Code Playgroud)
您可以看到由 PDFBox 合并创建的文件“C:\temp\7193671804393899.pdf”(为了可读性在输出中缩写)和使用编辑器创建的文件“test-pdf-3.pdf”都有相同的字体输出,每种字体显示两次,每页一个。
在 Acrobat Reader 中打开合并的文件,确认只存在一份字体副本: