Abt*_*Pst 8 python pdf parsing pdfminer
我正在尝试解析 PDF 并创建某种层次结构。考虑输入
Title 1
some text some text some text some text some text some text some text
some text some text some text some text some text some text some text
Title 1.1
some more text some more text some more text some more text
some more text some more text some more text some more text
some more text some more text
Title 2
some final text some final text
some final text some final text some final text some final text
some final text some final text some final text some final text
Run Code Online (Sandbox Code Playgroud)
这是我如何提取大纲/标题
path='myFile.pdf'
# Open a PDF file.
fp = open(path, 'rb')
# Create a PDF parser object associated with the file object.
parser = PDFParser(fp)
# Create a PDF document object that stores the document structure.
# Supply the password for initialization.
document = PDFDocument(parser, '')
outlines = document.get_outlines()
for (level,title,dest,a,se) in outlines:
print (level, title)
Run Code Online (Sandbox Code Playgroud)
这给了我
(1, u'Title 1')
(2, u'Title 1.1')
(1, u'Title 2')
Run Code Online (Sandbox Code Playgroud)
这是完美的,因为级别与文本层次结构对齐。现在我可以提取文本如下
if not document.is_extractable:
raise PDFTextExtractionNotAllowed
# Create a PDF resource manager object that stores shared resources.
rsrcmgr = PDFResourceManager()
# Create a PDF device object.
laparams = LAParams()
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
# Create a PDF interpreter object.
interpreter = PDFPageInterpreter(rsrcmgr, device)
# Process each page contained in the document.
text_from_pdf = open('textFromPdf.txt','w')
for page in PDFPage.create_pages(document):
interpreter.process_page(page)
layout = device.get_result()
for element in layout:
if isinstance(element, LTTextBox):
text_from_pdf.write(''.join([i if ord(i) < 128 else ' ' for i in element.get_text()]))
Run Code Online (Sandbox Code Playgroud)
这给了我
Title 1
some text some text some text some text some text some text some text
some text some text some text some text some text some text some text
Title 1.1
some more text some more text some more text some more text
some more text some more text some more text some more text
some more text some more text
Title 2
some final text some final text
some final text some final text some final text some final text
some final text some final text some final text some final text
Run Code Online (Sandbox Code Playgroud)
就订单而言这是可以的,但现在我已经失去了所有的层次感。我怎么知道一个标题在哪里结束,另一个标题在哪里开始?另外,如果有标题/标题,谁是父母?
有没有办法将outline信息连接到layout元素?能够在迭代级别的同时解析所有信息会很棒。
另一个问题是,如果页面底部有任何引文,引文文本就会与结果混合在一起。有没有办法在解析 PDF 时忽略页眉、页脚和引文?
小智 1
我希望这是可能的,但在 pdfminer 文档中明确说明如下
\n某些 PDF 文档使用页码作为目标,而其他 PDF 文档则使用页码和页面内的物理位置。由于 PDF 没有逻辑结构,并且它不提供从外部引用任何页内对象的方法,因此\xe2\x80\x99s 无法准确判断这些目标引用的是文本的哪一部分。
\n\n谢谢
\n| 归档时间: |
|
| 查看次数: |
1418 次 |
| 最近记录: |