相关疑难解决方法(0)

如何用Python从PDF中提取表格？

我有数千个 PDF 文件，仅由表格组成，结构如下：

然而，尽管结构相当合理，但我无法在不丢失结构的情况下阅读表格。

我尝试了 PyPDF2，但数据完全混乱。

import PyPDF2 

pdfFileObj = open(pdf_file.pdf, 'rb') 
pdfReader = PyPDF2.PdfFileReader(pdfFileObj) 
pageObj = pdfReader.getPage(0) 

print(pageObj.extractText())
print(pageObj.extractText().split('\n')[0]) 
print(pageObj.extractText().split('/')[0])

Run Code Online (Sandbox Code Playgroud)

我也尝试过 Tabula，但它只读取标题（而不是表格的内容）

from tabula import read_pdf

pdfFile1 = read_pdf(pdf_file.pdf, output_format = 'json') #Option 1: reads all the headers
pdfFile2 = read_pdf(pdf_file.pdf, multiple_tables = True) #Option 2: reads only the first header and few lines of content

Run Code Online (Sandbox Code Playgroud)

有什么想法吗？

python pdf

fma*_*ues

2019 05-08

7
推荐指数

1
解决办法

4万
查看次数