当我尝试从上面的pdf中提取文本时,我得到了一个在evince查看器中看不见的文本混合文本以及可见的文本.此外,一些所需的文本缺少观众中没有丢失的字符,例如"FALCONS"中的"S"和许多缺少的"½"字符.我认为这是由于隐形文本的干扰,因为当在查看器中突出显示pdf时,可以看到不可见文本与可见文本重叠.
有没有办法删除不可见的文本?还是有其他解决方案吗?
码:
import java.io.File;
import java.io.IOException;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
public class App {
public static String getPdfText(String pdfPath) throws IOException {
File file = new File(pdfPath);
PDDocument document = null;
PDFTextStripper textStripper = null;
String text = null;
try {
document = PDDocument.load(file);
textStripper = new PDFTextStripper();
textStripper.setEndPage(1);
text = textStripper.getText(document);
} catch (IOException e) {
throw new IOException("Could not load file and strip text.", e);
} finally {
try {
if (document != null)
document.close();
} catch (IOException e) {
System.out.println("Could not close document");
}
}
return text;
}
public static void main(String[] args) {
String filename = "RevTeaser09072016.pdf";
String text = null;
try {
text = getPdfText(filename);
} catch (IOException e) {
e.printStackTrace();
System.exit(1);
}
System.out.println(text);
}
}
Run Code Online (Sandbox Code Playgroud)
输出(粗体文本是所需文本):
145 143 159 144 160 141 157155 156154150 153149 152148 151147 142 158 500 146 Selections Number of Teams Amount Bet REVERSE tEaSER caRd mark box as shown ? denotes home team PRO FOOTBALL - THURSDAY, NOVEMBER 15, 2012 1 BILLS ? NFL PM8:25 2 DOLPHINS7– ½ 6– ½ PRO FOOTBALL - SUNDAY, NOVEMBER 18, 2012 3 REDSKINS ? PM1:00 4 EAGLES10– ½ 3– ½ 5 PACKERS PM1:00 6 LIONS ?10– ½ 3– ½ 7 FALCONS ? PM1:00 8 CARDINALS17– ½ 3+ ½ 9 BUCCANEERS PM1:00 10 PANTHERS ?7– ½ 6– ½ 11 COWBOYS ? PM1:00 12 BROWNS14– ½ + ½ 13 RAMS ? PM1:00 14 JETS10– ½ 3– ½ 15 PATRIOTS ? PM4:25 16 COLTS17– ½ 3+ ½ 17 TEXANS ? PM1:00 18 JAGUARS23– ½ 9+ ½ 19 BENGALS PM1:00 20 CHIEFS ?10– ½ 3– ½ 21 SAINTS PM4:05 22 RAIDERS ?12– ½ 1– ½ 23 BRONCOS ? PM4:25 24 CHARGERS14– ½ + ½ 25 RAVENS NBC PM8:30 26 STEELERS ?7– ½ 6– ½ PRO FOOTBALL - MONDAY, NOVEMBER 19, 2012 27 49ERS ? ESPN PM8:40 28 BEARS10– ½ 3– ½ 1,000 145 143 159 144 160 141 157155 156154150 153149 152148 151147 142 158 500 146 Selections Number of Teams Amount Bet REVERSE tEaSER caRd mark box as hown ? denotes home team PRO FOOTBALL - THURSDAY, NOVEMBER 15, 2012 1 BILLS ? NFL PM8:25 2 DOLPHINS7– ½ 6– ½ PRO FOOTBALL - SUNDAY, NOVEMBER 18, 2012 3 REDSKINS ? PM1:00 4 EAGLES10– ½ 3– ½ 5 PACKERS PM1:00 6 LIONS ?10– ½ 3– ½ 7 FALCONS ? PM1:00 8 CARDINALS17– ½ 3+ ½ 9 BUCCANEERS PM1:00 10 PANTHERS ?7– ½ 6– ½ 11 COWBOYS ? PM1:00 12 BROWNS14– ½ + ½ 13 RAMS ? PM1:00 14 JETS10– ½ 3– ½ 15 PATRIOTS ? PM4:25 16 COLTS17– ½ 3+ ½ 17 TEXANS ? PM1:00 18 JAGUARS23– ½ 9+ ½ 19 BENGALS PM1:00 20 CHIEFS ?10– ½ 3– ½ 21 SAINTS PM4:05 22 RAIDERS ?12– ½ 1– ½ 23 BRONCOS ? PM4:25 24 CHARGERS14– ½ + ½ 25 RAVENS NBC PM8:30 26 STEEL RS ?7– ½ 6– ½ PRO FOOTBALL - MONDAY, NOVEMBER 19, 2012 27 49ERS ? ESPN PM8:40 28 BEARS10– ½ 3– ½ 1,000 145 143 159 14 160 41 15715 156154150 153149 152148 51147 142 158 50 146 S lections Number of Teams Amount Bet ark box as sho n ? denotes home team PRO F OTBALL - THURSDAY, NOVEMBER 15, 2012 1 BILLS ? NFL PM8:25 2 DOLPHINS7– ½ 6– ½ PRO F OTBALL - SUNDAY, NOVEMBER 18, 2012 3 REDSKINS ? PM1:0 4 EAGLES10– ½ 3– ½ 5 PACKERS PM1:0 6 LIONS ?10– ½ 3– ½ 7 FALCONS ? PM1:0 8 CARDINALS17– ½ 3+ ½ 9 BU CANEERS PM1:0 10 PANTHERS ?7– ½ 6– ½ 11 COWBOYS ? PM1:0 12 BROWNS14– ½ + ½ 13 RAMS ? PM1:0 14 JETS10– ½ 3– ½ 15 PATRIOTS ? PM4:25 16 COLTS17– ½ 3+ ½ 17 TEXANS ? PM1:0 18 JAGUARS23– ½ 9+ ½ 19 BENGALS PM1:0 20 CHIEFS ?10– ½ 3– ½ 21 SAINTS PM4:05 22 RAIDERS ?12– ½ 1– ½ 23 BRONCOS ? PM4:25 24 CHARGERS14– ½ + ½ 25 RAVENS NBC PM8:30 26 STEELERS ?7– ½ 6– ½ PRO F OTBALL - MONDAY, NOVEMBER 19, 2012 27 49ERS ? ESPN PM8:40 28 BEARS10– ½ 3– ½ 1,0 MARK BOX AS SHOWN ? ?DENOTES HOME TEAM PRO FOOTBALL - THURSDAY, SEPTEMBER 8, 2016 1 PANTHERS nbc - 10½ 8:30p 2 BRONCOS ? - 3½ PRO FOOTBALL - SUNDAY, SEPTEMBER 11, 2016 FALCON ? - 9 1:00p 4 BUCCANEERS - 4½ 5 VIKINGS - 9½ 1:00p 6 TITANS ? - 4½ 7 EAGLES ? - 10½ 1:00p 8 BROWNS - 3½ 9 BENGALS - 9½ 1:00p 10 JETS ? - 4½ 11 SAINTS ? - 7½ 1:00p 12 RAIDERS - 6½ 13 CHIEFS ? - 14½ 1:00p 14 CHARGERS + ½ 15 RAVENS ? - 10½ 1:00p 16 BILLS - 3½ 17 TEXANS ? - 14 1:00p 18 BEARS + ½ 19 PACKERS - 12 1:00p 20 JAGUARS ? - 1½ 21 SEAHAWKS ? - 17½ 4:05p 22 DOLPHINS + 3½ 23 COWBOYS ? - 7½ 4:25p 24 GIANTS - 6½ 25 COLTS ? - 10½ 4:25p 26 LIONS - 3½ 27 CARDINALS ? nbc - 14½ 8:30p 28 PATRIOTS + ½ PRO FOOTBALL - MONDAY, SEPTEMBER 12, 2016 29 STEELERS espn - 10½ 7:10p 30 REDSKINS ? - 3½ 31 RAMS espn - 9 10:20p 32 49ERS ? - 4½
OP的示例PDF中的不可见文本通常通过定义剪辑路径(文本的边界之外)和填充路径(隐藏文本下方)而变得不可见.因此,我们必须在文本提取期间考虑路径相关指令以忽略该不可见文本.
不幸的是,为这些指令设计的回调没有在PDFTextStripper其父类LegacyPDFStreamEngine和它的父类中声明PDFStreamEngine.
但它们是在另一个主要的PDFStreamEngine子类中声明的PDFGraphicsStreamEngine,并且它们是明智地实现的PageDrawer.
因此,为了利用这一点,我们可以将PageDrawer实现复制并粘贴并调整为子类PDFTextStripper,例如:
public class PDFVisibleTextStripper extends PDFTextStripper {
public PDFVisibleTextStripper() throws IOException {
addOperator(new AppendRectangleToPath());
addOperator(new ClipEvenOddRule());
addOperator(new ClipNonZeroRule());
addOperator(new ClosePath());
addOperator(new CurveTo());
addOperator(new CurveToReplicateFinalPoint());
addOperator(new CurveToReplicateInitialPoint());
addOperator(new EndPath());
addOperator(new FillEvenOddAndStrokePath());
addOperator(new FillEvenOddRule());
addOperator(new FillNonZeroAndStrokePath());
addOperator(new FillNonZeroRule());
addOperator(new LineTo());
addOperator(new MoveTo());
addOperator(new StrokePath());
}
@Override
protected void processTextPosition(TextPosition text) {
Matrix textMatrix = text.getTextMatrix();
Vector start = textMatrix.transform(new Vector(0, 0));
Vector end = new Vector(start.getX() + text.getWidth(), start.getY());
PDGraphicsState gs = getGraphicsState();
Area area = gs.getCurrentClippingPath();
if (area == null || (area.contains(start.getX(), start.getY()) && area.contains(end.getX(), end.getY())))
super.processTextPosition(text);
}
private GeneralPath linePath = new GeneralPath();
void deleteCharsInPath() {
for (List<TextPosition> list : charactersByArticle) {
List<TextPosition> toRemove = new ArrayList<>();
for (TextPosition text : list) {
Matrix textMatrix = text.getTextMatrix();
Vector start = textMatrix.transform(new Vector(0, 0));
Vector end = new Vector(start.getX() + text.getWidth(), start.getY());
if (linePath.contains(start.getX(), start.getY()) || linePath.contains(end.getX(), end.getY())) {
toRemove.add(text);
}
}
if (toRemove.size() != 0) {
System.out.println(toRemove.size());
list.removeAll(toRemove);
}
}
}
public final class AppendRectangleToPath extends OperatorProcessor {
@Override
public void process(Operator operator, List<COSBase> operands) throws IOException {
if (operands.size() < 4) {
throw new MissingOperandException(operator, operands);
}
if (!checkArrayTypesClass(operands, COSNumber.class)) {
return;
}
COSNumber x = (COSNumber) operands.get(0);
COSNumber y = (COSNumber) operands.get(1);
COSNumber w = (COSNumber) operands.get(2);
COSNumber h = (COSNumber) operands.get(3);
float x1 = x.floatValue();
float y1 = y.floatValue();
// create a pair of coordinates for the transformation
float x2 = w.floatValue() + x1;
float y2 = h.floatValue() + y1;
Point2D p0 = context.transformedPoint(x1, y1);
Point2D p1 = context.transformedPoint(x2, y1);
Point2D p2 = context.transformedPoint(x2, y2);
Point2D p3 = context.transformedPoint(x1, y2);
// to ensure that the path is created in the right direction, we have to create
// it by combining single lines instead of creating a simple rectangle
linePath.moveTo((float) p0.getX(), (float) p0.getY());
linePath.lineTo((float) p1.getX(), (float) p1.getY());
linePath.lineTo((float) p2.getX(), (float) p2.getY());
linePath.lineTo((float) p3.getX(), (float) p3.getY());
// close the subpath instead of adding the last line so that a possible set line
// cap style isn't taken into account at the "beginning" of the rectangle
linePath.closePath();
}
@Override
public String getName() {
return "re";
}
}
public final class StrokePath extends OperatorProcessor {
@Override
public void process(Operator operator, List<COSBase> operands) throws IOException {
linePath.reset();
}
@Override
public String getName() {
return "S";
}
}
public final class FillEvenOddRule extends OperatorProcessor {
@Override
public void process(Operator operator, List<COSBase> operands) throws IOException {
linePath.setWindingRule(GeneralPath.WIND_EVEN_ODD);
deleteCharsInPath();
linePath.reset();
}
@Override
public String getName() {
return "f*";
}
}
public class FillNonZeroRule extends OperatorProcessor {
@Override
public final void process(Operator operator, List<COSBase> operands) throws IOException {
linePath.setWindingRule(GeneralPath.WIND_NON_ZERO);
deleteCharsInPath();
linePath.reset();
}
@Override
public String getName() {
return "f";
}
}
public final class FillEvenOddAndStrokePath extends OperatorProcessor {
@Override
public void process(Operator operator, List<COSBase> operands) throws IOException {
linePath.setWindingRule(GeneralPath.WIND_EVEN_ODD);
deleteCharsInPath();
linePath.reset();
}
@Override
public String getName() {
return "B*";
}
}
public class FillNonZeroAndStrokePath extends OperatorProcessor {
@Override
public void process(Operator operator, List<COSBase> operands) throws IOException {
linePath.setWindingRule(GeneralPath.WIND_NON_ZERO);
deleteCharsInPath();
linePath.reset();
}
@Override
public String getName() {
return "B";
}
}
public final class ClipEvenOddRule extends OperatorProcessor {
@Override
public void process(Operator operator, List<COSBase> operands) throws IOException {
linePath.setWindingRule(GeneralPath.WIND_EVEN_ODD);
getGraphicsState().intersectClippingPath(linePath);
}
@Override
public String getName() {
return "W*";
}
}
public class ClipNonZeroRule extends OperatorProcessor {
@Override
public void process(Operator operator, List<COSBase> operands) throws IOException {
linePath.setWindingRule(GeneralPath.WIND_NON_ZERO);
getGraphicsState().intersectClippingPath(linePath);
}
@Override
public String getName() {
return "W";
}
}
public final class MoveTo extends OperatorProcessor {
@Override
public void process(Operator operator, List<COSBase> operands) throws IOException {
if (operands.size() < 2) {
throw new MissingOperandException(operator, operands);
}
COSBase base0 = operands.get(0);
if (!(base0 instanceof COSNumber)) {
return;
}
COSBase base1 = operands.get(1);
if (!(base1 instanceof COSNumber)) {
return;
}
COSNumber x = (COSNumber) base0;
COSNumber y = (COSNumber) base1;
Point2D.Float pos = context.transformedPoint(x.floatValue(), y.floatValue());
linePath.moveTo(pos.x, pos.y);
}
@Override
public String getName() {
return "m";
}
}
public class LineTo extends OperatorProcessor {
@Override
public void process(Operator operator, List<COSBase> operands) throws IOException {
if (operands.size() < 2) {
throw new MissingOperandException(operator, operands);
}
COSBase base0 = operands.get(0);
if (!(base0 instanceof COSNumber)) {
return;
}
COSBase base1 = operands.get(1);
if (!(base1 instanceof COSNumber)) {
return;
}
// append straight line segment from the current point to the point
COSNumber x = (COSNumber) base0;
COSNumber y = (COSNumber) base1;
Point2D.Float pos = context.transformedPoint(x.floatValue(), y.floatValue());
linePath.lineTo(pos.x, pos.y);
}
@Override
public String getName() {
return "l";
}
}
public class CurveTo extends OperatorProcessor {
@Override
public void process(Operator operator, List<COSBase> operands) throws IOException {
if (operands.size() < 6) {
throw new MissingOperandException(operator, operands);
}
if (!checkArrayTypesClass(operands, COSNumber.class)) {
return;
}
COSNumber x1 = (COSNumber) operands.get(0);
COSNumber y1 = (COSNumber) operands.get(1);
COSNumber x2 = (COSNumber) operands.get(2);
COSNumber y2 = (COSNumber) operands.get(3);
COSNumber x3 = (COSNumber) operands.get(4);
COSNumber y3 = (COSNumber) operands.get(5);
Point2D.Float point1 = context.transformedPoint(x1.floatValue(), y1.floatValue());
Point2D.Float point2 = context.transformedPoint(x2.floatValue(), y2.floatValue());
Point2D.Float point3 = context.transformedPoint(x3.floatValue(), y3.floatValue());
linePath.curveTo(point1.x, point1.y, point2.x, point2.y, point3.x, point3.y);
}
@Override
public String getName() {
return "c";
}
}
public final class CurveToReplicateFinalPoint extends OperatorProcessor {
@Override
public void process(Operator operator, List<COSBase> operands) throws IOException {
if (operands.size() < 4) {
throw new MissingOperandException(operator, operands);
}
if (!checkArrayTypesClass(operands, COSNumber.class)) {
return;
}
COSNumber x1 = (COSNumber) operands.get(0);
COSNumber y1 = (COSNumber) operands.get(1);
COSNumber x3 = (COSNumber) operands.get(2);
COSNumber y3 = (COSNumber) operands.get(3);
Point2D.Float point1 = context.transformedPoint(x1.floatValue(), y1.floatValue());
Point2D.Float point3 = context.transformedPoint(x3.floatValue(), y3.floatValue());
linePath.curveTo(point1.x, point1.y, point3.x, point3.y, point3.x, point3.y);
}
@Override
public String getName() {
return "y";
}
}
public class CurveToReplicateInitialPoint extends OperatorProcessor {
@Override
public void process(Operator operator, List<COSBase> operands) throws IOException {
if (operands.size() < 4) {
throw new MissingOperandException(operator, operands);
}
if (!checkArrayTypesClass(operands, COSNumber.class)) {
return;
}
COSNumber x2 = (COSNumber) operands.get(0);
COSNumber y2 = (COSNumber) operands.get(1);
COSNumber x3 = (COSNumber) operands.get(2);
COSNumber y3 = (COSNumber) operands.get(3);
Point2D currentPoint = linePath.getCurrentPoint();
Point2D.Float point2 = context.transformedPoint(x2.floatValue(), y2.floatValue());
Point2D.Float point3 = context.transformedPoint(x3.floatValue(), y3.floatValue());
linePath.curveTo((float) currentPoint.getX(), (float) currentPoint.getY(), point2.x, point2.y, point3.x, point3.y);
}
@Override
public String getName() {
return "v";
}
}
public final class ClosePath extends OperatorProcessor {
@Override
public void process(Operator operator, List<COSBase> operands) throws IOException {
linePath.closePath();
}
@Override
public String getName() {
return "h";
}
}
public final class EndPath extends OperatorProcessor {
@Override
public void process(Operator operator, List<COSBase> operands) throws IOException {
linePath.reset();
}
@Override
public String getName() {
return "n";
}
}
}
Run Code Online (Sandbox Code Playgroud)
请确保在PDFVisibleTextStripper构造函数中使用内部运算符类,而不是使用PageDrawer具有相同名称的类.要确保只需按照代码下的链接.
这会减少输出
REVERSE tEaSER caRd
500
elections
er of Teams
t Bet
1,000
MARK BOX AS SHOWN ?
?DENOTES HOME TEAM
PRO FOOTBALL - THURSDAY, SEPTEMBER 8, 2016
1 PANTHERS nbc - 10½ 8:30p 2 BRONCOS ? - 3½
PRO FOOTBALL - SUNDAY, SEPTEMBER 11, 2016
3 FALCONS ? - 9½ 1:00p 4 BUCCANEERS - 4½
5 VIKINGS - 9½ 1:00p 6 TITANS ? - 4½
7 EAGLES ? - 10½ 1:00p 8 BROWNS - 3½
9 BENGALS - 9½ 1:00p 10 JETS ? - 4½
11 SAINTS ? - 7½ 1:00p 12 RAIDERS - 6½
13 CHIEFS ? - 14½ 1:00p 14 CHARGERS + ½
15 RAVENS ? - 10½ 1:00p 16 BILLS - 3½
17 TEXANS ? - 14½ 1:00p 18 BEARS + ½
19 PACKERS - 12½ 1:00p 20 JAGUARS ? - 1½
21 SEAHAWKS ? - 17½ 4:05p 22 DOLPHINS + 3½
23 COWBOYS ? - 7½ 4:25p 24 GIANTS - 6½
25 COLTS ? - 10½ 4:25p 26 LIONS - 3½
27 CARDINALS ? nbc - 14½ 8:30p 28 PATRIOTS + ½
PRO FOOTBALL - MONDAY, SEPTEMBER 12, 2016
29 STEELERS espn - 10½ 7:10p 30 REDSKINS ? - 3½
31 RAMS espn - 9½ 10:20p 32 49ERS ? - 4½
Run Code Online (Sandbox Code Playgroud)
这会丢弃大部分不需要的数据.
在这个问题的上下文中,很明显,方式processTextPosition和deleteCharsInPath计算字符基线的结束隐含地假设没有页面旋转的水平文本.但是,如果放松一个人的"可见性"标准,如果其基线的开始可见,则可以假定一个角色是可见的.在这种情况下,不再需要计算Vector end,并且代码也适用于旋转页面.
| 归档时间: |
|
| 查看次数: |
1065 次 |
| 最近记录: |