使用Apache POI将doc转换为pdf

Question

使用Apache POI将doc转换为pdf

5 java pdf pdf-generation doc apache-poi

我正在尝试使用Apache POI将doc转换为pdf,但生成的pdf文档只包含文本,它没有像图像,表格对齐等任何格式.

如何将doc转换为pdf,并具有表格,图像,对齐等所有格式？

这是我的代码:

import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.OutputStream;

import com.lowagie.text.Document;
import com.lowagie.text.DocumentException;
import com.lowagie.text.Paragraph;
import com.lowagie.text.pdf.PdfWriter;


import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.hwpf.extractor.WordExtractor;

import org.apache.poi.hwpf.usermodel.Range;
import org.apache.poi.poifs.filesystem.POIFSFileSystem;


public class demo {
    public static void main(String[] args) {

        POIFSFileSystem fs = null;  
        Document document = new Document();

         try {  
             System.out.println("Starting the test");  
             fs = new POIFSFileSystem(new FileInputStream("Resume.doc"));  

             HWPFDocument doc = new HWPFDocument(fs);  
             WordExtractor we = new WordExtractor(doc);  

             OutputStream file = new FileOutputStream(new File("test.pdf")); 

             PdfWriter writer = PdfWriter.getInstance(document, file);  

             Range range = doc.getRange();
             document.open();  
             writer.setPageEmpty(true);  
             document.newPage();  
             writer.setPageEmpty(true);  

             String[] paragraphs = we.getParagraphText();  
             for (int i = 0; i < paragraphs.length; i++) {  

                 org.apache.poi.hwpf.usermodel.Paragraph pr = range.getParagraph(i);
                 paragraphs[i] = paragraphs[i].replaceAll("\\cM?\r?\n", "");  
                 System.out.println("Length:" + paragraphs[i].length());  
                 System.out.println("Paragraph" + i + ": " + paragraphs[i].toString());  
                 // add the paragraph to the document  
                 document.add(new Paragraph(paragraphs[i]));  
             }  

             System.out.println("Document testing completed");  
         } catch (Exception e) {  
             System.out.println("Exception during test");  
             e.printStackTrace();  
         } finally {  
             // close the document  
             document.close();  
         }  
     }  
 }

Run Code Online (Sandbox Code Playgroud)

Answer 1

mkl*_*mkl 8

手头的任务是将doc转换为pdf,其中包含表格,图像,对齐等所有格式.

创建自己的转换器类

WordToXxxConverterApache POI中已经有类,即WordToFoConverter,WordToHtmlConverter和WordToTextConverter.后者很可能太过有损,无法满足您的要求,但前两者是足够的.

所有这些转换器类都派生自公共基类AbstractWordConverter,它为字转换类提供了基本框架.此外,所有这些类都使用匹配*DocumentFacade类,该类包装具体目标(或某些中间)格式创建:FoDocumentFacade,HtmlDocumentFacade或TextDocumentFacade.

因此,要实现将doc转换为pdf且具有表格,图像,对齐等所有格式的任务,您还应该从该AbstractWordConverter派生转换器类,并且为了实现抽象方法,让自己受到三个具体实现类的启发.就像在其他转换器类中一样,将特定于PDF库的特定代码集中到一个PdfDocumentFacade类中似乎是一个好主意.

如果你想开始简单并稍后添加更复杂的细节,你可以先使用很多WordToTextConverter实现代码,并且一旦至少在概念验证级别上工作,扩展功能也会覆盖越来越多格式化信息.

不幸的是,这个转换器框架在某种程度上以DOM元素为中心:AbstractWordConverter回调期望并转发DOM元素作为当前目标文档上下文的指示符; 乍一看,它似乎没有使用该上下文作为DOM元素,因此您可能会复制该基类并使用更多的apropos类型或更好的泛型类参数交换这些DOM元素参数.

将现有的Word-to-XXX转换器与现有的XXX-to-Pdf转换器结合使用

如果这对您的资源来说似乎过于复杂或耗时,您可以尝试不同的方法:您可以尝试使用上面提到的某个现有转换器的输出作为另一个转换为Pdf的输入.

使用现有的转换类会提前得出结果,但多步转换往往比单步转换更有损.决定取决于你.

在您在问题中发布的代码中,您使用了iText类.iText确实支持使用iText XML Worker子项目中XMLWorker提供的某些限制从HTML转换为PDF .在古代的iText版本中,过去也曾经被弃用过.因此,将WordToHtmlConverter与iText结合使用可能是您的选择.HTMLWorkerXMLWorker

另外,Apache还为PDF提供XSL FO处理.这适用于WordToFoConverter的输出也可能是一个选项

归档时间：	12 年，7 月前
查看次数：	26965 次
最近记录：	10 年前