我试图读出一个 pdf 文档表,但我面临一个问题。
如果我定期打开 PDF 它显示为:
item[tab]item[tab]item[tab]item[tab]item
item[tab]item[tab]item[tab]item[tab]item
item[tab]item[tab]item[tab]item[tab]item
Run Code Online (Sandbox Code Playgroud)
我使用以下方法转换 PDF:
StringBuilder result = new StringBuilder();
PdfDocument pdfDoc = new PdfDocument(new PdfReader(SRC));
LocationTextExtractionStrategy strategy = new LocationTextExtractionStrategy();
PdfCanvasProcessor parser = new PdfCanvasProcessor(strategy);
for (int i = 1; i <= pdfDoc.GetNumberOfPages(); i++)
{
result.AppendLine("INFO_START_PAGE");
string output = PdfTextExtractor.GetTextFromPage(pdfDoc.GetPage(i));
/*Note, in the GetTextFromPage i replaced the method to output [tab] instead of a regular space on
big spaces*/
foreach(string data in output.Replace("\r\n", "\n").Replace("\n", "×").Split('×'))
{
result.AppendLine(data.Trim().Replace(" ", "[tab]"));
}
result.AppendLine("INFO_END_PAGE");
} …Run Code Online (Sandbox Code Playgroud)