无法成功执行以下脚本

Question

无法成功执行以下脚本

SIM*_*SIM 1 python pdf python-imaging-library python-3.x

我已经写了使用与组合蟒蛇脚本PyPDF2,PIL并pytesseract从中提取的第一页的文字the scanned pages一个的pdf文件.然而,当我尝试下面的脚本来从内容first scanned page指出的pdf文件,当它到达包含该行引发以下错误img = Image.open(pdfReader.getPage(0)).convert('L').

到目前为止我尝试过的脚本:

import PyPDF2
import pytesseract
from PIL import Image

pdfFileObj = open(r'C:\Users\WCS\Desktop\Scan project\Scanned.pdf', 'rb')

pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
img = Image.open(pdfReader.getPage(0)).convert('L')
imagetext = pytesseract.image_to_string(img)
print(imagetext)
pdfFileObj.close()

Run Code Online (Sandbox Code Playgroud)

错误我有:

Traceback (most recent call last):
  File "C:\Users\WCS\AppData\Local\Programs\Python\Python36-32\SO.py", line 8, in <module>
    img = Image.open(pdfReader.getPage(0)).convert('L')
  File "C:\Users\WCS\AppData\Local\Programs\Python\Python36-32\lib\site-packages\PIL\Image.py", line 2554, in open
    fp = io.BytesIO(fp.read())
AttributeError: 'PageObject' object has no attribute 'read'

Run Code Online (Sandbox Code Playgroud)

我怎样才能成功？

Answer 1

Tar*_*ani 5

您需要先将pdf转换为图像然后再进行转换

Python:从pdf中提取页面作为jpeg

import PyPDF2
import pytesseract
from PIL import Image
from pdf2image import convert_from_path

pdfFileObj = r'C:\Users\WCS\Desktop\Scan project\Scanned.pdf'
pages = convert_from_path(pdfFileObj, 500)

page = pages[0]
page.save('out.png')

img = Image.open('out.png').convert('L')
imagetext = pytesseract.image_to_string(img)
print(imagetext)
pdfFileObj.close()

Run Code Online (Sandbox Code Playgroud)

归档时间：	7 年，7 月前
查看次数：	1641 次
最近记录：	7 年，7 月前