SIM*_*SIM 1 python pdf python-imaging-library python-3.x
我已经写了使用与组合蟒蛇脚本PyPDF2,PIL并pytesseract从中提取的第一页的文字the scanned pages一个的pdf文件.然而,当我尝试下面的脚本来从内容first scanned page指出的pdf文件,当它到达包含该行引发以下错误img = Image.open(pdfReader.getPage(0)).convert('L').
到目前为止我尝试过的脚本:
import PyPDF2
import pytesseract
from PIL import Image
pdfFileObj = open(r'C:\Users\WCS\Desktop\Scan project\Scanned.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
img = Image.open(pdfReader.getPage(0)).convert('L')
imagetext = pytesseract.image_to_string(img)
print(imagetext)
pdfFileObj.close()
Run Code Online (Sandbox Code Playgroud)
错误我有:
Traceback (most recent call last):
File "C:\Users\WCS\AppData\Local\Programs\Python\Python36-32\SO.py", line 8, in <module>
img = Image.open(pdfReader.getPage(0)).convert('L')
File "C:\Users\WCS\AppData\Local\Programs\Python\Python36-32\lib\site-packages\PIL\Image.py", line 2554, in open
fp = io.BytesIO(fp.read())
AttributeError: 'PageObject' object has no attribute 'read'
Run Code Online (Sandbox Code Playgroud)
我怎样才能成功?
您需要先将pdf转换为图像然后再进行转换
import PyPDF2
import pytesseract
from PIL import Image
from pdf2image import convert_from_path
pdfFileObj = r'C:\Users\WCS\Desktop\Scan project\Scanned.pdf'
pages = convert_from_path(pdfFileObj, 500)
page = pages[0]
page.save('out.png')
img = Image.open('out.png').convert('L')
imagetext = pytesseract.image_to_string(img)
print(imagetext)
pdfFileObj.close()
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
1641 次 |
| 最近记录: |