如何使用iTextSharp将HTML转换为PDF

Chr*_*aas 66 c# pdf-generation itextsharp xmlworker

我想使用iTextSharp将以下HTML转换为PDF,但不知道从哪里开始:

<style>
.headline{font-size:200%}
</style>
<p>
  This <em>is </em>
  <span class="headline" style="text-decoration: underline;">some</span>
  <strong>sample<em> text</em></strong>
  <span style="color: red;">!!!</span>
</p>
Run Code Online (Sandbox Code Playgroud)

Chr*_*aas 144

首先,虽然HTML和PDF是在同一时间创建的,但它们并不相关.HTML旨在传达更高级别的信息,例如段落和表格.虽然有控制它的方法,但最终由浏览器来绘制这些更高级别的概念.PDF旨在传达文档,文档无论在何处呈现都必须 "看起来".

在HTML文档中,您可能有一个100%宽的段落,根据显示器的宽度,它可能需要2行或10行,当您打印时,它可能是7行,当您在手机上查看时,它可能会需要20行.但是,PDF文件必须独立于渲染设备,因此无论屏幕大小如何,它都必须始终呈现完全相同的格式.

由于的葡萄汁以上,PDF不支持像"表"或"段落"抽象的东西.PDF支持三种基本内容:文本,线条/形状和图像.(还有其他的东西,如注释和电影,但我试图在这里保持简单.)在PDF中,你没有说"这是一个段落,浏览器做你的事!".相反,你说,"使用这个确切的字体在这个确切的X,Y位置绘制这个文本,不要担心,我之前已经计算过文本的宽度,所以我知道它将全部适合这一行".你也不要说"这是一张桌子",而是你说"在这个确切的位置画这个文字,然后在我之前计算过的另一个确切的位置画一个矩形,所以我知道它会出现在文本周围".

其次,iText和iTextSharp解析HTML和CSS.而已.ASP.Net,MVC,Razor,Struts,Spring等都是HTML框架,但iText/iTextSharp 100%不知道它们.与DataGridViews,Repeater,Templates,Views等相同,它们都是特定于框架的抽象.这是你的责任,从你选择的框架的HTML,iText的不会帮你.如果你得到一个例外,The document has no pages或者你认为"iText没有解析我的HTML",那么几乎可以肯定你实际上 没有HTML,你只会认为你这样做.

第三,这是已经存在多年了内置类是HTMLWorker但是这已经被替换XMLWorker(的Java/).零工作正在进行HTMLWorker,不支持CSS文件,并且对最基本的CSS属性的支持有限,并且实际上在某些标签上中断.如果您在此文件中没有看到HTML属性或CSS属性和值,则可能不支持它HTMLWorker.XMLWorker有时可能会更复杂,但这些并发症也使它 更具 可扩展性.

下面是C#代码,它显示了如何将HTML标记解析为iText抽象,这些抽象会自动添加到您正在处理的文档中.C#和Java非常相似,因此转换它应该相对容易.Example#1使用内置HTMLWorker来解析HTML字符串.由于只支持内联样式,因此class="headline"忽略但其他所有内容都应该可以正常工作.示例#2与第一个示例相同,只是它使用了XMLWorker.Example#3也解析了简单的CSS示例.

//Create a byte array that will eventually hold our final PDF
Byte[] bytes;

//Boilerplate iTextSharp setup here
//Create a stream that we can write to, in this case a MemoryStream
using (var ms = new MemoryStream()) {

    //Create an iTextSharp Document which is an abstraction of a PDF but **NOT** a PDF
    using (var doc = new Document()) {

        //Create a writer that's bound to our PDF abstraction and our stream
        using (var writer = PdfWriter.GetInstance(doc, ms)) {

            //Open the document for writing
            doc.Open();

            //Our sample HTML and CSS
            var example_html = @"<p>This <em>is </em><span class=""headline"" style=""text-decoration: underline;"">some</span> <strong>sample <em> text</em></strong><span style=""color: red;"">!!!</span></p>";
            var example_css = @".headline{font-size:200%}";

            /**************************************************
             * Example #1                                     *
             *                                                *
             * Use the built-in HTMLWorker to parse the HTML. *
             * Only inline CSS is supported.                  *
             * ************************************************/

            //Create a new HTMLWorker bound to our document
            using (var htmlWorker = new iTextSharp.text.html.simpleparser.HTMLWorker(doc)) {

                //HTMLWorker doesn't read a string directly but instead needs a TextReader (which StringReader subclasses)
                using (var sr = new StringReader(example_html)) {

                    //Parse the HTML
                    htmlWorker.Parse(sr);
                }
            }

            /**************************************************
             * Example #2                                     *
             *                                                *
             * Use the XMLWorker to parse the HTML.           *
             * Only inline CSS and absolutely linked          *
             * CSS is supported                               *
             * ************************************************/

            //XMLWorker also reads from a TextReader and not directly from a string
            using (var srHtml = new StringReader(example_html)) {

                //Parse the HTML
                iTextSharp.tool.xml.XMLWorkerHelper.GetInstance().ParseXHtml(writer, doc, srHtml);
            }

            /**************************************************
             * Example #3                                     *
             *                                                *
             * Use the XMLWorker to parse HTML and CSS        *
             * ************************************************/

            //In order to read CSS as a string we need to switch to a different constructor
            //that takes Streams instead of TextReaders.
            //Below we convert the strings into UTF8 byte array and wrap those in MemoryStreams
            using (var msCss = new MemoryStream(System.Text.Encoding.UTF8.GetBytes(example_css))) {
                using (var msHtml = new MemoryStream(System.Text.Encoding.UTF8.GetBytes(example_html))) {

                    //Parse the HTML
                    iTextSharp.tool.xml.XMLWorkerHelper.GetInstance().ParseXHtml(writer, doc, msHtml, msCss);
                }
            }


            doc.Close();
        }
    }

    //After all of the PDF "stuff" above is done and closed but **before** we
    //close the MemoryStream, grab all of the active bytes from the stream
    bytes = ms.ToArray();
}

//Now we just need to do something with those bytes.
//Here I'm writing them to disk but if you were in ASP.Net you might Response.BinaryWrite() them.
//You could also write the bytes to a database in a varbinary() column (but please don't) or you
//could pass them to another function for further PDF processing.
var testFile = Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop), "test.pdf");
System.IO.File.WriteAllBytes(testFile, bytes);
Run Code Online (Sandbox Code Playgroud)

2017年的更新

有关HTML-to-PDF要求的好消息.正如这个答案所示,W3C标准css-break-3将解决这个问题 ......这是一个候选推荐标准,计划在经过测试后于今年转为最终推荐标准.

由于没有那么标准,因此有一些解决方案,带有C#插件,如print-css.rocks所示.

  • 如果有人在寻找iTextSharp.tool的解决方案,你必须执行NuGet命令:Install-Package itextsharp.xmlworker (8认同)
  • 非常好的例子.谢谢. (4认同)
  • `iTextSharp.tool`命名空间给我一个不存在的错误,我也得到`iTextSharp.text.html.simpleparser.HTMLWorker(doc))已过时`版本(5.5.8.0) (4认同)
  • 带有"iTextSharp.text.html.simpleparser.HTMLWorker(doc))"的行说"htmlWorker"已经过时了5.5.10.应该改变什么? (3认同)
  • 该代码声明了一个"new Document()"并注释该Document类型是"iTextSharp Document".该引用应完全命名为"iTextSharp.text.Document()".我正在使用iTextSharp的项目已经有了一个Document类,我不得不深入了解iTextSharp命名空间来更正引用. (2认同)
  • 这段代码很简单就是"无法访问封闭的Stream".错误.我使用的是itextsharp 5.5.10和itextsharp.xmlworker 5.5.10 (2认同)

Sam*_*Sam 9

@Chris Haas已经很好地解释了如何使用itextSharp转换HTMLPDF,非常有用
我的补充是:
通过使用HtmlTextWriter我把html标签放在HTML表+内联CSS中我得到了我想要的PDF而不使用XMLWorker.
编辑:添加示例代码:
ASPX页面:

<asp:Panel runat="server" ID="PendingOrdersPanel">
 <!-- to be shown on PDF-->
 <table style="border-spacing: 0;border-collapse: collapse;width:100%;display:none;" >
 <tr><td><img src="abc.com/webimages/logo1.png" style="display: none;" width="230" /></td></tr>
<tr style="line-height:10px;height:10px;"><td style="display:none;font-size:9px;color:#10466E;padding:0px;text-align:right;">blablabla.</td></tr>
 <tr style="line-height:10px;height:10px;"><td style="display:none;font-size:9px;color:#10466E;padding:0px;text-align:right;">blablabla.</td></tr>
 <tr style="line-height:10px;height:10px;"><td style="display:none;font-size:9px;color:#10466E;padding:0px;text-align:right;">blablabla</td></tr>
<tr style="line-height:10px;height:10px;"><td style="display:none;font-size:9px;color:#10466E;padding:0px;text-align:right;">blablabla</td></tr>
<tr style="line-height:10px;height:10px;"><td style="display:none;font-size:11px;color:#10466E;padding:0px;text-align:center;"><i>blablabla</i> Pending orders report<br /></td></tr>
 </table>
<asp:GridView runat="server" ID="PendingOrdersGV" RowStyle-Wrap="false" AllowPaging="true" PageSize="10" Width="100%" CssClass="Grid" AlternatingRowStyle-CssClass="alt" AutoGenerateColumns="false"
   PagerStyle-CssClass="pgr" HeaderStyle-ForeColor="White" PagerStyle-HorizontalAlign="Center" HeaderStyle-HorizontalAlign="Center" RowStyle-HorizontalAlign="Center" DataKeyNames="Document#" 
      OnPageIndexChanging="PendingOrdersGV_PageIndexChanging" OnRowDataBound="PendingOrdersGV_RowDataBound" OnRowCommand="PendingOrdersGV_RowCommand">
   <EmptyDataTemplate><div style="text-align:center;">no records found</div></EmptyDataTemplate>
    <Columns>                                           
     <asp:ButtonField CommandName="PendingOrders_Details" DataTextField="Document#" HeaderText="Document #" SortExpression="Document#" ItemStyle-ForeColor="Black" ItemStyle-Font-Underline="true"/>
      <asp:BoundField DataField="Order#" HeaderText="order #" SortExpression="Order#"/>
     <asp:BoundField DataField="Order Date" HeaderText="Order Date" SortExpression="Order Date" DataFormatString="{0:d}"></asp:BoundField> 
    <asp:BoundField DataField="Status" HeaderText="Status" SortExpression="Status"></asp:BoundField>
    <asp:BoundField DataField="Amount" HeaderText="Amount" SortExpression="Amount" DataFormatString="{0:C2}"></asp:BoundField> 
   </Columns>
    </asp:GridView>
</asp:Panel>
Run Code Online (Sandbox Code Playgroud)

C#代码:

protected void PendingOrdersPDF_Click(object sender, EventArgs e)
{
    if (PendingOrdersGV.Rows.Count > 0)
    {
        //to allow paging=false & change style.
        PendingOrdersGV.HeaderStyle.ForeColor = System.Drawing.Color.Black;
        PendingOrdersGV.BorderColor = Color.Gray;
        PendingOrdersGV.Font.Name = "Tahoma";
        PendingOrdersGV.DataSource = clsBP.get_PendingOrders(lbl_BP_Id.Text);
        PendingOrdersGV.AllowPaging = false;
        PendingOrdersGV.Columns[0].Visible = false; //export won't work if there's a link in the gridview
        PendingOrdersGV.DataBind();

        //to PDF code --Sam
        string attachment = "attachment; filename=report.pdf";
        Response.ClearContent();
        Response.AddHeader("content-disposition", attachment);
        Response.ContentType = "application/pdf";
        StringWriter stw = new StringWriter();
        HtmlTextWriter htextw = new HtmlTextWriter(stw);
        htextw.AddStyleAttribute("font-size", "8pt");
        htextw.AddStyleAttribute("color", "Grey");

        PendingOrdersPanel.RenderControl(htextw); //Name of the Panel
        Document document = new Document();
        document = new Document(PageSize.A4, 5, 5, 15, 5);
        FontFactory.GetFont("Tahoma", 50, iTextSharp.text.BaseColor.BLUE);
        PdfWriter.GetInstance(document, Response.OutputStream);
        document.Open();

        StringReader str = new StringReader(stw.ToString());
        HTMLWorker htmlworker = new HTMLWorker(document);
        htmlworker.Parse(str);

        document.Close();
        Response.Write(document);
    }
}
Run Code Online (Sandbox Code Playgroud)

当然包括iTextSharp Refrences到cs文件

using iTextSharp.text;
using iTextSharp.text.pdf;
using iTextSharp.text.html.simpleparser;
using iTextSharp.tool.xml;
Run Code Online (Sandbox Code Playgroud)

希望这可以帮助!
谢谢


Geo*_*dze 7

截至2018年,还有iText7(旧的iTextSharp库的下一次迭代)及其HTML到PDF包可用:itext7.pdfhtml

用法很简单:

HtmlConverter.ConvertToPdf(
    new FileInfo(@"Path\to\Html\File.html"),
    new FileInfo(@"Path\to\Pdf\File.pdf")
);
Run Code Online (Sandbox Code Playgroud)

方法有更多的重载.

更新: iText*系列产品具有双重许可模式:免费开源,付费用于商业用途.