使用itextsharp从pdf获取文本段落

Question

使用itextsharp从pdf获取文本段落

Bib*_*tam 6 c# asp.net itextsharp pdf-parsing

有什么逻辑可以使用itextsharp从pdf文件中获取段落文本吗？我知道pdf只支持文本的运行,很难确定哪些文本运行与哪个段落相关,而且我知道没有任何<p>标签或其他标签确定pdf中的段落.但是我试图获得文本运行的坐标来从其坐标构建段落但没有运气:(.我的代码片段在这里:

private StringBuilder result = new StringBuilder();
    private Vector lastBaseLine;
    //to store run of texts 
    public List<string> strings = new List<String>();
    //to store run of texts Coordinate (Y coordinate)
    public List<float> baselines = new List<float>();

    public void RenderText(iTextSharp.text.pdf.parser.TextRenderInfo renderInfo)
    {
        Vector curBaseline = renderInfo.GetBaseline().GetStartPoint();
        if ((this.lastBaseLine != null) && (curBaseline[Vector.I2] != lastBaseLine[Vector.I2]))
        {
            if ((!string.IsNullOrEmpty(this.result.ToString())))
            {
                this.baselines.Add(this.lastBaseLine[Vector.I2]);
                this.strings.Add(this.result.ToString());
            }
            result = new StringBuilder();
        }
        this.result.Append(renderInfo.GetText());
        this.lastBaseLine = curBaseline;
    }

Run Code Online (Sandbox Code Playgroud)

任何机构都有与此问题相关的任何逻辑吗？

Answer 1

Vin*_*n M 1

using (MemoryStream ms = new MemoryStream())
{
   Document document = new Document(PageSize.A4, 25, 25, 30, 30);
   PdfWriter writer = PdfWriter.GetInstance(document, ms);
   document.Open();
   document.Add(new Paragraph("Hello World"));
   document.Close();
   writer.Close();
   Response.ContentType = "pdf/application";
   Response.AddHeader("content-disposition", 
   "attachment;filename=First PDF document.pdf");
   Response.OutputStream.Write(ms.GetBuffer(), 0, ms.GetBuffer().Length);
}

Run Code Online (Sandbox Code Playgroud)

这里有一些示例可以帮助您......

这可能不完全是您想要的，但它可能会帮助您。

归档时间：	12 年，8 月前
查看次数：	2216 次
最近记录：	12 年，6 月前