如何使用 PDFMiner 从 pdf 中提取表格?

Abt*_*Pst 6 python pdf parsing pdfminer

我正在尝试从 pdf 文档中的某些表格中提取信息。
考虑输入:

Title 1
some text some text some text some text some text
some text some text some text some text some text

Table Title
| Col1          | Col2    | Col3    |
|---------------|---------|---------|
| val11         | val12   | val13   |
| val21         | val22   | val23   |
| val31         | val32   | val33   |

Title 2
some more text some more text some more text some more text
some more text
some more text some more text some more text some more text
Run Code Online (Sandbox Code Playgroud)

我可以得到这样的大纲/标题:

path='myFile.pdf'
# Open a PDF file.
fp = open(path, 'rb')
# Create a PDF parser object associated with the file object.
parser = PDFParser(fp)
# Create a PDF document object that stores the document structure.
# Supply the password for initialization.
document = PDFDocument(parser, '')
outlines = document.get_outlines()
for (level,title,dest,a,se) in outlines:
    print (level, title)
Run Code Online (Sandbox Code Playgroud)

这给了我:

path='myFile.pdf'
# Open a PDF file.
fp = open(path, 'rb')
# Create a PDF parser object associated with the file object.
parser = PDFParser(fp)
# Create a PDF document object that stores the document structure.
# Supply the password for initialization.
document = PDFDocument(parser, '')
outlines = document.get_outlines()
for (level,title,dest,a,se) in outlines:
    print (level, title)
Run Code Online (Sandbox Code Playgroud)

这是完美的,因为级别与文本层次结构一致。现在我可以提取文本如下:

if not document.is_extractable:
    raise PDFTextExtractionNotAllowed
# Create a PDF resource manager object that stores shared resources.
rsrcmgr = PDFResourceManager()
# Create a PDF device object.
laparams = LAParams()
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
# Create a PDF interpreter object.
interpreter = PDFPageInterpreter(rsrcmgr, device)
# Process each page contained in the document.
text_from_pdf = open('textFromPdf.txt','w')
for page in PDFPage.create_pages(document):
    interpreter.process_page(page)
    layout = device.get_result()
    for element in layout:
        if isinstance(element, LTTextBox):
            text_from_pdf.write(''.join([i if ord(i) < 128 else ' '
                                            for i in element.get_text()]))
Run Code Online (Sandbox Code Playgroud)

这给了我:

(1, u'Title 1')
(2, u'Table Title')
(1, u'Title 2')
Run Code Online (Sandbox Code Playgroud)

这有点奇怪,因为表是按列方式提取的。我可以逐行获取表格吗?此外,如何确定表格的开始和结束位置?

小智 4

如果您只想从 PDF 文档中提取表格,请查看这个答案:How to extract table as text from the PDF using Python?

从这个答案中,我尝试了tabula-py,它对我有用,可以处理分布在多页 PDF 上的图表。tabula-py 正确地跳过了所有页眉和页脚。之前我曾在同一类型的文档上尝试过 PDFMiner,并且遇到了您提到的相同问题,有时甚至更糟。