将pdf读入python的最佳实践

Question

我正在尝试阅读 pdf 文档（我删除了敏感数据的一些内容原因： https: //ufile.io/bgghw）读入python。我必须使用复选框并根据这些文本和其他文本执行操作。

我尝试了 PyPDF3，但它只提供了损坏的输出，经过一番研究后，我发现 pdfminer 听起来很有前途，但使用 python 2.7 的缺点是。

我不确定是否还有其他包，或者是否有在 python 中使用 pdf 的最佳实践，因为我得到的所有信息都是几年前的，而且大多数信息都是非常相反的。当然，我可以选择最适合我的情况的套餐:)

感谢您的任何建议！

Answer 1

第一个选项：pypdf

首先在 cmd 中运行此命令来安装 pypdf：（可能比您已经尝试过的 PyPDF3 更好）

pip install pypdf

然后使用以下代码从 pdf 文件中提取文本：

pip install pypdf

第二个选项：文本

在 cmd 中运行此命令来安装 texttract

pip install textract

然后要阅读 pdf，请使用以下代码：

import textract
text = textract.process('path/to/pdf/file', method='pdfminer')

祝你好运！