tabula 与 camelot 用于从 PDF 中提取表格

Question

tabula 与 camelot 用于从 PDF 中提取表格

Nir*_*mar 2 python pdf tabula python-camelot

我需要从 pdf 中提取表格，这些表格可以是任何类型、多个标题、垂直标题、水平标题等。

我已经实现了两者的基本用例，发现 tabula 比 Camelot 做得更好，但仍然无法完美地检测所有表，我不确定它是否适用于所有类型。

因此，向实施过类似用例的专家寻求建议。

表格实现：

import tabula
tab = tabula.read_pdf('pdfs/PDF1.pdf', pages='all')
for t in tab:
    print(t, "\n=========================\n")

Run Code Online (Sandbox Code Playgroud)

Camelot 实现：

import camelot
tables = camelot.read_pdf('pdfs/PDF1.pdf', pages='all', split_text=True)
tables
for tabs in tables:
    print(tabs.df, "\n=================================\n")

Run Code Online (Sandbox Code Playgroud)

Answer 1

Ste*_*n87 6

请阅读：https : //camelot-py.readthedocs.io/en/master/#why-camelot

Camelot的主要优点是这个库包含丰富的参数，通过它可以提高提取。

显然，这些参数的应用需要一些研究和各种尝试。

在这里您可以找到 Camelot 与其他 PDF 表格提取库的比较。

归档时间：	5 年，9 月前
查看次数：	3305 次
最近记录：	5 年，9 月前