提取跨越多页的表格

Question

提取跨越多页的表格

Pro*_*azy 4 python screen-scraping tabula

我正在尝试从 pdf 中提取表格。Tabula 帮助我从 pdf 中提取表格。

目前我面临的问题是，如果任何表格跨越多个页面，Tabula 会将每个新页表内容视为新表。

有什么方法或逻辑来克服这个问题吗？

代码：

from tabula import read_pdf
df = read_pdf("SampleTableFormat2pages.pdf", multiple_tables=True, pages="all")
print len(df)
print df

Run Code Online (Sandbox Code Playgroud)

输出

2
[        0       1       2       3       4
0  Label1  Label2  Label3  Label4  Label5
1   Row11   Row12   Row13   Row14   Row15
2   Row21   Row22   Row23   Row24   Row25
3   Row31   Row32   Row33   Row34   Row35,        0      1      2      3      4
0  Row41  Row42  Row43  Row44  Row45
1  Row51  Row52  Row53  Row54  Row55]

Run Code Online (Sandbox Code Playgroud)

任何解释 Tabula 以了解表格边界和下一页跨越的逻辑？

或者任何其他可以帮助解决这个问题的图书馆？

Answer 1

小智 6

我会建议一次访问每个页面并连接决赛桌。您可以将此功能用于pdf中的页数

import re
def count_pdf_pages(file_path):
   rxcountpages = re.compile(r"/Type\s*/Page([^s]|$)", re.MULTILINE|re.DOTALL)
   with open(file_path, "rb") as temp_file:
   return len(rxcountpages.findall(temp_file.read()))

Run Code Online (Sandbox Code Playgroud)

现在循环遍历带有表格的每个页面

df=pd.DataFrame([])
df_combine=pd.DataFrame([])
for pageiter in range(pages):
            df = tabula.read_pdf("SampleTableFormat2pages.pdf",pages=pageiter+1, guess=False)
            #If you want to change the table by editing the columns you can do that here.
            df_combine=pd.concat([df,df_combine],) #again you can choose between merge or concat as per your need

Run Code Online (Sandbox Code Playgroud)

归档时间：	7 年，4 月前
查看次数：	7831 次
最近记录：	7 年，3 月前