我试图读出一个 pdf 文档表,但我面临一个问题。
如果我定期打开 PDF 它显示为:
item[tab]item[tab]item[tab]item[tab]item
item[tab]item[tab]item[tab]item[tab]item
item[tab]item[tab]item[tab]item[tab]item
Run Code Online (Sandbox Code Playgroud)
我使用以下方法转换 PDF:
StringBuilder result = new StringBuilder();
PdfDocument pdfDoc = new PdfDocument(new PdfReader(SRC));
LocationTextExtractionStrategy strategy = new LocationTextExtractionStrategy();
PdfCanvasProcessor parser = new PdfCanvasProcessor(strategy);
for (int i = 1; i <= pdfDoc.GetNumberOfPages(); i++)
{
result.AppendLine("INFO_START_PAGE");
string output = PdfTextExtractor.GetTextFromPage(pdfDoc.GetPage(i));
/*Note, in the GetTextFromPage i replaced the method to output [tab] instead of a regular space on
big spaces*/
foreach(string data in output.Replace("\r\n", "\n").Replace("\n", "×").Split('×'))
{
result.AppendLine(data.Trim().Replace(" ", "[tab]"));
}
result.AppendLine("INFO_END_PAGE");
}
pdfDoc.Close();
return result.ToString();
Run Code Online (Sandbox Code Playgroud)
在某些情况下,当我尝试使用 Pdf 到文本转换读取此内容时,它显示为:
item[tab]item[tab]item[tab]item[tab]item
item[tab]item[tab]item[tab]
item[tab]item
item[tab]item[tab]item[tab]item[tab]item
Run Code Online (Sandbox Code Playgroud)
有没有办法解决这个问题?
提取为
Artikelnr. Omschrijving Aantal
Per stuk Kosten
VERHUUR L. GELEVERDE ARBEID PDC 8 € 43,70 € 349,60
VERHUUR O. GELEVERDE ARBEID PDC 3 € 60,95 € 182,85
VERHUUR L.L. GELEVERDE ARBEID EM 24
€ 32,20 € 772,80
Run Code Online (Sandbox Code Playgroud)
正如对问题的评论所推测的,确实存在一个小的垂直步骤,在所有行中,前三列设置在相同的垂直位置,后两列的垂直位置略有不同,
Row First columns y Last columns y
Heading row 536 535.893
First row 516 516.229
Second row 495 495.478
Third row 475 474.788
Run Code Online (Sandbox Code Playgroud)
One recognizes in particular that the rows broken by text extraction are those in which the pre-decimal point digits of the y positions differ (536 vs 535, 475 vs 474) while those with equal pre-decimal point digits are not broken.
The reason for this is that the class TextChunkLocationDefaultImp (which by default is used to store text chunk locations and methods to compare such locations) stores the y position of a chunk (actually an abstraction of it also working for text not written horizontally) in an integer variable (private readonly int distPerpendicular) and in the test method SameLine requires equality of the distPerpendicular values.
namespace iText.Kernel.Pdf.Canvas.Parser.Listener {
internal class TextChunkLocationDefaultImp : ITextChunkLocation {
...
/// <summary>Perpendicular distance to the orientation unit vector (i.e. the Y position in an unrotated coordinate system).
/// </summary>
/// <remarks>
/// Perpendicular distance to the orientation unit vector (i.e. the Y position in an unrotated coordinate system).
/// We round to the nearest integer to handle the fuzziness of comparing floats.
/// </remarks>
private readonly int distPerpendicular;
...
/// <param name="as">the location to compare to</param>
/// <returns>true is this location is on the the same line as the other</returns>
public virtual bool SameLine(ITextChunkLocation @as) {
...
float distPerpendicularDiff = DistPerpendicular() - @as.DistPerpendicular();
if (distPerpendicularDiff == 0) {
return true;
}
...
}
...
}
}
Run Code Online (Sandbox Code Playgroud)
(Actually SameLine further down allows a small deviation if one of the compared text chunks has a zero length. Apparently chunks with zero length sometimes are used for diacritical marks, and such marks sometimes are applied at different heights. This is of no concern in your example file, though.)
As we've seen above, the problem is due to the behavior of TextChunkLocationDefaultImp.SameLine. Thus, we have to change this behavior. Usually, though, we don't want to change the code of the iText classes themselves.
Fortunately, the LocationTextExtractionStrategy has a constructor that allows to inject an ITextChunkLocationStrategy implementation, i.e. a factory object for ITextChunkLocation instances.
Thus, for our task we have to write an alternative ITextChunkLocation implementation which is not so strict, and an ITextChunkLocationStrategy implementation that generates instances of our ITextChunkLocation implementation.
Unfortunately, though, TextChunkLocationDefaultImp is internal to iText and has numerous private variables. Thus, we cannot simply derive our implementation from it but have to copy and paste it as a whole and apply our changes to that copy.
Thus,
class LaxTextChunkLocationStrategy : LocationTextExtractionStrategy.ITextChunkLocationStrategy
{
public LaxTextChunkLocationStrategy()
{
}
public virtual ITextChunkLocation CreateLocation(TextRenderInfo renderInfo, LineSegment baseline)
{
return new TextChunkLocationLaxImp(baseline.GetStartPoint(), baseline.GetEndPoint(), renderInfo.GetSingleSpaceWidth());
}
}
class TextChunkLocationLaxImp : ITextChunkLocation
{
private const float DIACRITICAL_MARKS_ALLOWED_VERTICAL_DEVIATION = 2;
private readonly Vector startLocation;
private readonly Vector endLocation;
private readonly Vector orientationVector;
private readonly int orientationMagnitude;
private readonly int distPerpendicular;
private readonly float distParallelStart;
private readonly float distParallelEnd;
private readonly float charSpaceWidth;
public TextChunkLocationLaxImp(Vector startLocation, Vector endLocation, float charSpaceWidth)
{
this.startLocation = startLocation;
this.endLocation = endLocation;
this.charSpaceWidth = charSpaceWidth;
Vector oVector = endLocation.Subtract(startLocation);
if (oVector.Length() == 0)
{
oVector = new Vector(1, 0, 0);
}
orientationVector = oVector.Normalize();
orientationMagnitude = (int)(Math.Atan2(orientationVector.Get(Vector.I2), orientationVector.Get(Vector.I1)) * 1000);
Vector origin = new Vector(0, 0, 1);
distPerpendicular = (int)(startLocation.Subtract(origin)).Cross(orientationVector).Get(Vector.I3);
distParallelStart = orientationVector.Dot(startLocation);
distParallelEnd = orientationVector.Dot(endLocation);
}
public virtual int OrientationMagnitude()
{
return orientationMagnitude;
}
public virtual int DistPerpendicular()
{
return distPerpendicular;
}
public virtual float DistParallelStart()
{
return distParallelStart;
}
public virtual float DistParallelEnd()
{
return distParallelEnd;
}
public virtual Vector GetStartLocation()
{
return startLocation;
}
public virtual Vector GetEndLocation()
{
return endLocation;
}
public virtual float GetCharSpaceWidth()
{
return charSpaceWidth;
}
public virtual bool SameLine(ITextChunkLocation @as)
{
if (OrientationMagnitude() != @as.OrientationMagnitude())
{
return false;
}
int distPerpendicularDiff = DistPerpendicular() - @as.DistPerpendicular();
if (Math.Abs(distPerpendicularDiff) < 2)
{
return true;
}
LineSegment mySegment = new LineSegment(startLocation, endLocation);
LineSegment otherSegment = new LineSegment(@as.GetStartLocation(), @as.GetEndLocation());
return Math.Abs(distPerpendicularDiff) <= DIACRITICAL_MARKS_ALLOWED_VERTICAL_DEVIATION && (mySegment.GetLength() == 0 || otherSegment.GetLength() == 0);
}
public virtual float DistanceFromEndOf(ITextChunkLocation other)
{
return DistParallelStart() - other.DistParallelEnd();
}
public virtual bool IsAtWordBoundary(ITextChunkLocation previous)
{
if (startLocation.Equals(endLocation) || previous.GetEndLocation().Equals(previous.GetStartLocation()))
{
return false;
}
float dist = DistanceFromEndOf(previous);
if (dist < 0)
{
dist = previous.DistanceFromEndOf(this);
//The situation when the chunks intersect. We don't need to add space in this case
if (dist < 0)
{
return false;
}
}
return dist > GetCharSpaceWidth() / 2.0f;
}
internal static bool ContainsMark(ITextChunkLocation baseLocation, ITextChunkLocation markLocation)
{
return baseLocation.GetStartLocation().Get(Vector.I1) <= markLocation.GetStartLocation().Get(Vector.I1) &&
baseLocation.GetEndLocation().Get(Vector.I1) >= markLocation.GetEndLocation().Get(Vector.I1) && Math.
Abs(baseLocation.DistPerpendicular() - markLocation.DistPerpendicular()) <= DIACRITICAL_MARKS_ALLOWED_VERTICAL_DEVIATION;
}
}
Run Code Online (Sandbox Code Playgroud)
Now to make your code use these classes, replace
string output = PdfTextExtractor.GetTextFromPage(pdfDoc.GetPage(i));
Run Code Online (Sandbox Code Playgroud)
by
LocationTextExtractionStrategy laxStrategy = new LocationTextExtractionStrategy(new LaxTextChunkLocationStrategy());
string output = PdfTextExtractor.GetTextFromPage(pdfDoc.GetPage(i), laxStrategy);
Run Code Online (Sandbox Code Playgroud)
and the text extraction result becomes
Artikelnr. Omschrijving Aantal Per stuk Kosten
VERHUUR L. GELEVERDE ARBEID PDC 8 € 43,70 € 349,60
VERHUUR O. GELEVERDE ARBEID PDC 3 € 60,95 € 182,85
VERHUUR L.L. GELEVERDE ARBEID EM 24 € 32,20 € 772,80
Run Code Online (Sandbox Code Playgroud)
as was desired.
In a comment you asked
May i ask how you exemined the pdf to know the exact locations of the rows?
I inspected the page using iText RUPS:
In the contents of the stream selected in the screen shot I found:
q
...
q
1 0 0 1 60 536 cm
BT
8 0 0 8 0 0 Tm
/F3 1 Tf
(Artikelnr) Tj
8 0 0 8 31.84 0 Tm
(.) Tj
ET
Q
Q
q
...
q
1 0 0 1 147 536 cm
BT
8 0 0 8 0 0 Tm
/F3 1 Tf
(Omschrijving) Tj
ET
Q
Q
q
...
q
1 0 0 1 370 536 cm
BT
8 0 0 8 0 0 Tm
/F3 1 Tf
(Aantal) Tj
ET
Q
Q
q
...
q
1 0 0 1 433.404 535.893 cm
BT
8 0 0 8 0 0 Tm
/F3 1 Tf
(Per stuk) Tj
ET
Q
Q
q
...
q
1 0 0 1 504.878 535.893 cm
BT
8 0 0 8 0 0 Tm
/F3 1 Tf
(Kosten) Tj
ET
Q
Q
Run Code Online (Sandbox Code Playgroud)
Before the first three headings you see
1 0 0 1 XXX 536 cm
Run Code Online (Sandbox Code Playgroud)
while before the last two headings you see
1 0 0 1 XXX 535.893 cm
Run Code Online (Sandbox Code Playgroud)
As the text matrix always is set with 8 0 0 8 XXX 0 Tm to have no translation part along the y axis, the cm instructions above set the coordinate system so that text is drawn at y position 536 or 535.893 respectively.