Tabula-py 没有正确拆分列

Question

Tabula-py 没有正确拆分列

gig*_*iga 5 python pdf python-3.x tabula

我刚刚发现了 tabula-py（当然还有 tabula-java）从 pdf 中提取表格的乐趣。我现在正在为我的工作编写一个脚本，它从 pdf 表中读取一些数据，稍微清理一下，然后将其导出到 excel 中。我用的pdf每天都是一样的格式，表格总是在某个区域。为了检测区域，我使用 tabula.exe：我选择表格，可视化预览（看起来不错），然后导出脚本，以便查看 tabula.exe 使用的 -a 参数。然后我在我的 Python 命令中使用它，即：

df = tabula.read_pdf(os.fsdecode(directory)+filename, encoding = 'ISO-8859-1',
stream=True, area = "81.106,302.475,384.697,552.491", pages = 2, pandas_options={'header':None})

Run Code Online (Sandbox Code Playgroud)

我使用编码参数是因为标准 utf-8 返回错误，而流方法是因为它在 tabula.exe 中显示了一个很好的提取表。但是，数据框有一个问题，因为前 2 列（在 tabula.exe 的预览中正确显示为 2 个不同的列）实际上是一个单列，因此名称和值混合在一起。

你知道为什么同样的区域在 tabula-py 和 tabula.exe 中会产生 2 个不同的结果吗？非常感谢！

Answer 1

gig*_*iga 4

在 GitHub 上找到了答案：tabula-py 默认情况下将“猜测”选项设置为 True。因此，要纠正差异，您只需添加guess=False，输出将是相同的！

    df = tabula.read_pdf(os.fsdecode(directory)+filename, encoding = 'ISO-8859-1', 
         stream=True, area = "81.106,302.475,384.697,552.491", pages = 2, guess = False,  pandas_options={'header':None})

Run Code Online (Sandbox Code Playgroud)

归档时间：	7 年，12 月前
查看次数：	10778 次
最近记录：	4 年，6 月前