我如何在python中阅读pdf？

Question

我如何在python中阅读pdf？

sg1*_*994 13 python pdf text-extraction python-2.7

我如何在python中阅读pdf？ 我知道将其转换为文本的一种方法,但我想直接从pdf中阅读内容.

任何人都可以解释python中哪个模块最适合pdf提取

Answer 1

你可以使用PyPDF2包

#install pyDF2
pip install PyPDF2

# importing all the required modules
import PyPDF2

# creating an object 
file = open('example.pdf', 'rb')

# creating a pdf reader object
fileReader = PyPDF2.PdfFileReader(file)

# print the number of pages in pdf file
print(fileReader.numPages)

Run Code Online (Sandbox Code Playgroud)

请遵循此文档http://pythonhosted.org/PyPDF2/

图书馆又恢复了它的名字“pypdf” (5认同)
PyPDF2、PyPDF3 和 PyPDF4 不受维护。[我推荐使用pymupdf](/sf/answers/4446261571/) (4认同)
您并没有真正在这里说如何获取 pdf 的实际文本。您的代码仅在 0x10d31f278> 处创建了一个 <PyPDF2.pdf.PdfFileReader 对象。 (3认同)
是否有解决“PyPDF2.utils.PdfReadError：EOF 标记未找到”错误的解决方法？ (2认同)

Answer 2

Kal*_*llz 5

您可以在python中使用textract模块

文字练习

安装

pip install textract

Run Code Online (Sandbox Code Playgroud)

阅读PDF

import textract
text = textract.process('path/to/pdf/file', method='pdfminer')

Run Code Online (Sandbox Code Playgroud)

详细信息

据我所知，textract已被破坏。 (5认同)
Textract 似乎也死了：https://github.com/deanmalmgren/texttract/issues/350 (2认同)

Answer 3

小智 5

试试PyPDF2。

这里有一个很好的教程：https : //automatetheboringstuff.com/chapter13/

归档时间：	8 年，6 月前
查看次数：	100479 次
最近记录：	6 年，9 月前