我有一个1500多页的pdf,带有一些"随机"文本,我必须从中提取一些文本...我可以识别出那样的块:
bla bla bla bla bla
...
...
...
-------------------------- (separator blue image)
XXX: TEXT TEXT TEXT
TEXT TEXT TEXT TEXT
...
-------------------------- (separator blue image)
bla bla bla bla
...
...
-------------------------- (separator blue image)
XXX: TEXT2 TEXT2 TEXT2
TEXT2 TEXT2 TEXT TEXT2
...
-------------------------- (separator blue image)
Run Code Online (Sandbox Code Playgroud)
我需要提取所有文本beetween分隔符(所有块)'XXX'出现在所有块的开头,但我没有办法检测块的结尾.是否可以在解析器中使用图像分隔符?怎么样?
还有其他可能的方法吗
编辑更多信息没有背景和文本是复制和可管理的
样本pdf:1
查看示例第320页
谢谢
如果是PDF示例,则使用矢量图形创建分隔符:
0.58 0.17 0 0.47 K
q 1 0 0 1 56.6929 772.726 cm
0 0 m
249.118 0 l
S
Q
q 1 0 0 1 56.6929 690.9113 cm
0 0 m
249.118 0 l
S
Run Code Online (Sandbox Code Playgroud)
等等
解析矢量图形是iText(夏普)的一个相当新的补充,在这方面,API可以进行一些更改.目前(版本5.5.6),您可以使用接口ExtRenderListener(Java)/ IExtRenderListener(.Net)的实现来解析矢量图形.
您现在可以使用一些方法来完成任务:
LocationTextExtractionStrategy并使用适当ITextChunkFilter的GetResultantText(ITextChunkFilter)过载请求每个矩形的文本.(由于我在Java中比在C#中更流利,我在Java中为iText实现了这个示例.应该很容易移植到C#和iTextSharp.)
此实现尝试提取由分隔符分隔的文本部分,如示例PDF中所示.
它是一种一次通过的解决方案,同时LocationTextExtractionStrategy通过推导出该策略来重用现有功能.
在同一个传递中,此策略收集文本块(由于其父类)和分隔线(由于其实现了ExtRenderListener额外的方法).
解析页面后,策略Section通过该方法提供实例列表getSections(),每个实例表示由上方和/或下方的分隔线分隔的页面部分.每个文本列的最顶部和最底部分在顶部或底部打开,由匹配的边界线隐式分隔.
Section实现TextChunkFilter接口,因此,可以使用getResultantText(TextChunkFilter)父类的方法检索页面的相应部分中的文本.
This is merely a POC, it is designed to extract sections from documents using dividers exactly like the sample document does, i.e. horizontal lines drawn using moveTo-lineTo-stroke as wide as the section is, appearing in the content stream column-wise sorted. There may be still more implicit assumptions true for the sample PDF.
public class DividerAwareTextExtrationStrategy extends LocationTextExtractionStrategy implements ExtRenderListener
{
//
// constructor
//
/**
* The constructor accepts top and bottom margin lines in user space y coordinates
* and left and right margin lines in user space x coordinates.
* Text outside those margin lines is ignored.
*/
public DividerAwareTextExtrationStrategy(float topMargin, float bottomMargin, float leftMargin, float rightMargin)
{
this.topMargin = topMargin;
this.bottomMargin = bottomMargin;
this.leftMargin = leftMargin;
this.rightMargin = rightMargin;
}
//
// Divider derived section support
//
public List<Section> getSections()
{
List<Section> result = new ArrayList<Section>();
// TODO: Sort the array columnwise. In case of the OP's document, the lines already appear in the
// correct order, so there was no need for sorting in the POC.
LineSegment previous = null;
for (LineSegment line : lines)
{
if (previous == null)
{
result.add(new Section(null, line));
}
else if (Math.abs(previous.getStartPoint().get(Vector.I1) - line.getStartPoint().get(Vector.I1)) < 2) // 2 is a magic number...
{
result.add(new Section(previous, line));
}
else
{
result.add(new Section(previous, null));
result.add(new Section(null, line));
}
previous = line;
}
return result;
}
public class Section implements TextChunkFilter
{
LineSegment topLine;
LineSegment bottomLine;
final float left, right, top, bottom;
Section(LineSegment topLine, LineSegment bottomLine)
{
float left, right, top, bottom;
if (topLine != null)
{
this.topLine = topLine;
top = Math.max(topLine.getStartPoint().get(Vector.I2), topLine.getEndPoint().get(Vector.I2));
right = Math.max(topLine.getStartPoint().get(Vector.I1), topLine.getEndPoint().get(Vector.I1));
left = Math.min(topLine.getStartPoint().get(Vector.I1), topLine.getEndPoint().get(Vector.I1));
}
else
{
top = topMargin;
left = leftMargin;
right = rightMargin;
}
if (bottomLine != null)
{
this.bottomLine = bottomLine;
bottom = Math.min(bottomLine.getStartPoint().get(Vector.I2), bottomLine.getEndPoint().get(Vector.I2));
right = Math.max(bottomLine.getStartPoint().get(Vector.I1), bottomLine.getEndPoint().get(Vector.I1));
left = Math.min(bottomLine.getStartPoint().get(Vector.I1), bottomLine.getEndPoint().get(Vector.I1));
}
else
{
bottom = bottomMargin;
}
this.top = top;
this.bottom = bottom;
this.left = left;
this.right = right;
}
//
// TextChunkFilter
//
@Override
public boolean accept(TextChunk textChunk)
{
// TODO: This code only checks the text chunk starting point. One should take the
// whole chunk into consideration
Vector startlocation = textChunk.getStartLocation();
float x = startlocation.get(Vector.I1);
float y = startlocation.get(Vector.I2);
return (left <= x) && (x <= right) && (bottom <= y) && (y <= top);
}
}
//
// ExtRenderListener implementation
//
/**
* <p>
* This method stores targets of <code>moveTo</code> in {@link #moveToVector}
* and targets of <code>lineTo</code> in {@link #lineToVector}. Any unexpected
* contents or operations result in clearing of the member variables.
* </p>
* <p>
* So this method is implemented for files with divider lines exactly like in
* the OP's sample file.
* </p>
*
* @see ExtRenderListener#modifyPath(PathConstructionRenderInfo)
*/
@Override
public void modifyPath(PathConstructionRenderInfo renderInfo)
{
switch (renderInfo.getOperation())
{
case PathConstructionRenderInfo.MOVETO:
{
float x = renderInfo.getSegmentData().get(0);
float y = renderInfo.getSegmentData().get(1);
moveToVector = new Vector(x, y, 1);
lineToVector = null;
break;
}
case PathConstructionRenderInfo.LINETO:
{
float x = renderInfo.getSegmentData().get(0);
float y = renderInfo.getSegmentData().get(1);
if (moveToVector != null)
{
lineToVector = new Vector(x, y, 1);
}
break;
}
default:
moveToVector = null;
lineToVector = null;
}
}
/**
* This method adds the current path to {@link #lines} if it consists
* of a single line, the operation is no no-op, and the line is
* approximately horizontal.
*
* @see ExtRenderListener#renderPath(PathPaintingRenderInfo)
*/
@Override
public Path renderPath(PathPaintingRenderInfo renderInfo)
{
if (moveToVector != null && lineToVector != null &&
renderInfo.getOperation() != PathPaintingRenderInfo.NO_OP)
{
Vector from = moveToVector.cross(renderInfo.getCtm());
Vector to = lineToVector.cross(renderInfo.getCtm());
Vector extent = to.subtract(from);
if (Math.abs(20 * extent.get(Vector.I2)) < Math.abs(extent.get(Vector.I1)))
{
LineSegment line;
if (extent.get(Vector.I1) >= 0)
line = new LineSegment(from, to);
else
line = new LineSegment(to, from);
lines.add(line);
}
}
moveToVector = null;
lineToVector = null;
return null;
}
/* (non-Javadoc)
* @see com.itextpdf.text.pdf.parser.ExtRenderListener#clipPath(int)
*/
@Override
public void clipPath(int rule)
{
}
//
// inner members
//
final float topMargin, bottomMargin, leftMargin, rightMargin;
Vector moveToVector = null;
Vector lineToVector = null;
final List<LineSegment> lines = new ArrayList<LineSegment>();
}
Run Code Online (Sandbox Code Playgroud)
(DividerAwareTextExtrationStrategy.java)
It can be used like this
String extractAndStore(PdfReader reader, String format, int from, int to) throws IOException
{
StringBuilder builder = new StringBuilder();
for (int page = from; page <= to; page++)
{
PdfReaderContentParser parser = new PdfReaderContentParser(reader);
DividerAwareTextExtrationStrategy strategy = parser.processContent(page, new DividerAwareTextExtrationStrategy(810, 30, 20, 575));
List<Section> sections = strategy.getSections();
int i = 0;
for (Section section : sections)
{
String sectionText = strategy.getResultantText(section);
Files.write(Paths.get(String.format(format, page, i)), sectionText.getBytes("UTF8"));
builder.append("--\n")
.append(sectionText)
.append('\n');
i++;
}
builder.append("\n\n");
}
return builder.toString();
}
Run Code Online (Sandbox Code Playgroud)
(DividerAwareTextExtraction.java method extractAndStore)
Applying this method to pages 319 and 320 of your sample PDF
PdfReader reader = new PdfReader("20150211600.PDF");
String content = extractAndStore(reader, new File(RESULT_FOLDER, "20150211600.%s.%s.txt").toString(), 319, 320);
Run Code Online (Sandbox Code Playgroud)
(DividerAwareTextExtraction.java test test20150211600_320)
results in
--
do(s) bem (ns) exceder o seu crédito, depositará, no prazo de 3 (três)
dias, a diferença, sob pena de ser tornada sem efeito a arrematação
[...]
EDITAL DE INTIMAÇÃO DE ADVOGADOS
RELAÇÃO Nº 0041/2015
--
ADV: RODRIGO TADEU PIMENTA DE OLIVEIRA (OAB 16752/
SC)
Processo 0033473-16.2010.8.24.0023 (023.10.033473-6) - Ação Penal
Militar - Procedimento Ordinário - Militar - Autor: Ministério Público
do Estado de Santa Catarina - Réu: João Gabriel Adler - Publicada a
sentença neste ato, lida às partes e intimados os presentes. Registre-se.
A defesa manifesta o interesse em recorrer da sentença.
--
ADV: RODRIGO TADEU PIMENTA DE OLIVEIRA (OAB 16752/
SC), CARLOS ROBERTO PEREIRA (OAB 29179/SC), ROBSON
LUIZ CERON (OAB 22475/SC)
Processo 0025622-86.2011.8.24.0023 (023.11.025622-3) - Ação
[...]
1, NIVAEL MARTINS PADILHA, Mat. 928313-7, ANDERSON
VOGEL e ANTÔNIO VALDEMAR FORTES, no ato deprecado.
--
--
ADV: RODRIGO TADEU PIMENTA DE OLIVEIRA (OAB 16752/
SC)
Processo 0006958-36.2013.8.24.0023 (023.13.006958-5) - Ação Penal
Militar - Procedimento Ordinário - Crimes Militares - Autor: Ministério
Público do Estado de Santa Catarina - Réu: Pedro Conceição Bungarten
- Ficam intimadas as partes, da decisão de fls. 289/290, no prazo de
05 (cinco) dias.
--
ADV: RODRIGO TADEU PIMENTA DE OLIVEIRA (OAB 16752/
SC), ROBSON LUIZ CERON (OAB 22475/SC)
Processo 0006967-95.2013.8.24.0023 (023.13.006967-4) - Ação Penal
[...]
a presença dos réus no ato deprecado.
--
ADV: RODRIGO TADEU PIMENTA DE OLIVEIRA (OAB 16752/
SC)
Processo 0016809-02.2013.8.24.0023 - Ação Penal Militar -
[...]
prazo de 05 (cinco) dias.
--
ADV: RODRIGO TADEU PIMENTA DE OLIVEIRA (OAB 16752/
SC), ELIAS NOVAIS PEREIRA (OAB 30513/SC), ROBSON LUIZ
CERON (OAB 22475/SC)
Processo 0021741-33.2013.8.24.0023 - Ação Penal Militar -
[...]
a presença dos réus no ato deprecado.
--
ADV: RODRIGO TADEU PIMENTA DE OLIVEIRA (OAB 16752/
SC)
Processo 0024568-17.2013.8.24.0023 - Ação Penal Militar -
[...]
do CPPM
--
ADV: RODRIGO TADEU PIMENTA DE OLIVEIRA (OAB 16752/
SC)
Processo 0034522-87.2013.8.24.0023 - Ação Penal Militar -
[...]
diligências, consoante o art. 427 do CPPM
--
ADV: SANDRO MARCELO PEROTTI (OAB 8949/SC), NOEL
ANTÔNIO BARATIERI (OAB 16462/SC), RODRIGO TADEU
PIMENTA DE OLIVEIRA (OAB 16752/SC)
Processo 0041634-10.2013.8.24.0023 - Ação Penal Militar -
Procedimento Ordinário - Crimes Militares - Autor: M. P. E. - Réu: J. P.
D. - Defiro a juntada dos documentos de pp. 3214-3217. Oficie-se com
urgência à Comarca de Porto União (ref. Carta Precatória n. 0000463-
--
15.2015.8.24.0052), informando a habilitação dos procuradores. Intime-
se, inclusive os novos constituídos da designação do ato.
--
ADV: SANDRO MARCELO PEROTTI (OAB 8949/SC), NOEL
ANTÔNIO BARATIERI (OAB 16462/SC), RODRIGO TADEU
PIMENTA DE OLIVEIRA (OAB 16752/SC)
Processo 0041634-10.2013.8.24.0023 - Ação Penal Militar -
[...]
imprescindível a presença dos réus no ato deprecado.
--
ADV: RODRIGO TADEU PIMENTA DE OLIVEIRA (OAB 16752/
SC)
Processo 0043998-52.2013.8.24.0023 - Ação Penal Militar -
[...]
de parcelas para desconto remuneratório. Intimem-se.
--
ADV: RODRIGO TADEU PIMENTA DE OLIVEIRA (OAB 16752/
SC)
Processo 0049304-02.2013.8.24.0023 - Ação Penal Militar -
[...]
Rel. Ângela Maria Silveira).
--
ADV: ROBSON LUIZ CERON (OAB 22475/SC)
Processo 0000421-87.2014.8.24.0023 - Ação Penal Militar -
[...]
prazo de 05 (cinco) dias.
--
ADV: RODRIGO TADEU PIMENTA DE OLIVEIRA (OAB 16752/
SC)
Processo 0003198-45.2014.8.24.0023 - Ação Penal Militar -
[...]
de 05 (cinco) dias.
--
ADV: ISAEL MARCELINO COELHO (OAB 13878/SC), ROBSON
LUIZ CERON (OAB 22475/SC)
Processo 0010380-82.2014.8.24.0023 - Ação Penal Militar -
Procedimento Ordinário - Crimes Militares - Autor: Ministério Público
Estadual - Réu: Vilson Diocimar Antunes - HOMOLOGO o pedido
de desistência. Intime-se a defesa para o que preceitua o artigo 417,
§2º, do Código de Processo Penal Militar.
Run Code Online (Sandbox Code Playgroud)
(shortened a bit for obvious reasons)
In a comment the OP wrote:
还有一点,我如何识别部分内部的字体大小/颜色变化?在某些没有分隔符的情况下我需要它(只有更大的标题)(例如第346页,"Armazém"应该结束部分)
作为一个例子,我扩展了DividerAwareTextExtrationStrategy上面的内容,将给定颜色的上升文本行添加到已经找到的分隔线中:
public class DividerAndColorAwareTextExtractionStrategy extends DividerAwareTextExtrationStrategy
{
//
// constructor
//
public DividerAndColorAwareTextExtractionStrategy(float topMargin, float bottomMargin, float leftMargin, float rightMargin, BaseColor headerColor)
{
super(topMargin, bottomMargin, leftMargin, rightMargin);
this.headerColor = headerColor;
}
//
// DividerAwareTextExtrationStrategy overrides
//
/**
* As the {@link DividerAwareTextExtrationStrategy#lines} are not
* properly sorted anymore (the additional lines come after all
* divider lines of the same column), we have to sort that {@link List}
* first.
*/
@Override
public List<Section> getSections()
{
Collections.sort(lines, new Comparator<LineSegment>()
{
@Override
public int compare(LineSegment o1, LineSegment o2)
{
Vector start1 = o1.getStartPoint();
Vector start2 = o2.getStartPoint();
float v1 = start1.get(Vector.I1), v2 = start2.get(Vector.I1);
if (Math.abs(v1 - v2) < 2)
{
v1 = start2.get(Vector.I2);
v2 = start1.get(Vector.I2);
}
return Float.compare(v1, v2);
}
});
return super.getSections();
}
/**
* The ascender lines of text rendered using a fill color approximately
* like the given header color are added to the divider lines.
*/
@Override
public void renderText(TextRenderInfo renderInfo)
{
if (approximates(renderInfo.getFillColor(), headerColor))
{
lines.add(renderInfo.getAscentLine());
}
super.renderText(renderInfo);
}
/**
* This method checks whether two colors are approximately equal. As the
* sample document only uses CMYK colors, only this comparison has been
* implemented yet.
*/
boolean approximates(BaseColor colorA, BaseColor colorB)
{
if (colorA == null || colorB == null)
return colorA == colorB;
if (colorA instanceof CMYKColor && colorB instanceof CMYKColor)
{
CMYKColor cmykA = (CMYKColor) colorA;
CMYKColor cmykB = (CMYKColor) colorB;
float c = Math.abs(cmykA.getCyan() - cmykB.getCyan());
float m = Math.abs(cmykA.getMagenta() - cmykB.getMagenta());
float y = Math.abs(cmykA.getYellow() - cmykB.getYellow());
float k = Math.abs(cmykA.getBlack() - cmykB.getBlack());
return c+m+y+k < 0.01;
}
// TODO: Implement comparison for other color types
return false;
}
final BaseColor headerColor;
}
Run Code Online (Sandbox Code Playgroud)
(DividerAndColorAwareTextExtractionStrategy.java)
在renderText我们识别文本headerColor并将其各自的顶行添加到lines列表中.
Beware: we add the ascender line of each chunk in the given color. We actually should join the ascender lines of all text chunks forming a single header line. As the blue header lines in the sample document consist of merely a single chunk, we don't need to in this sample code. A generic solution would have to be appropriately extended.
As the lines are not properly sorted anymore (the additional ascender lines come after all divider lines of the same column), we have to sort that list first.
Please be aware that the Comparator used here is not really proper: It ignores a certain difference in the x coordinate which makes it not really transitive. It only works if the individual lines of the same column have approximately the same starting x coordinate differing clearly from those of different columns.
In a test run (cf. DividerAndColorAwareTextExtraction.java method test20150211600_346) the found sections are also split at the blue headings "Armazém" and "Balneário Camboriú".
Please be aware of the restrictions I pointed out above. If e.g. you want to split at the grey headings in your sample document, you'll have to improve the methods above as those headings don't come in a single chunk.