为什么用PIL和pytesseract无法获得字符串？

Question

为什么用PIL和pytesseract无法获得字符串？

it_*_*ure 8 python ocr python-3.x python-tesseract

它是Python 3中的一个简单的光学字符识别（OCR）程序，用于获取字符串，我已经在此处上传了目标gif文件，请下载并将其另存为/tmp/target.gif。

try:
    from PIL import Image
except ImportError:
    import Image
import pytesseract
print(pytesseract.image_to_string(Image.open('/tmp/target.gif')))

Run Code Online (Sandbox Code Playgroud)

我将所有错误信息粘贴到此处，请修复它以从图像中获取字符。

/usr/lib/python3/dist-packages/PIL/Image.py:925: UserWarning: Couldn't allocate palette entry for transparency
  "for transparency")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.5/dist-packages/pytesseract/pytesseract.py", line 309, in image_to_string
    }[output_type]()
  File "/usr/local/lib/python3.5/dist-packages/pytesseract/pytesseract.py", line 308, in <lambda>
    Output.STRING: lambda: run_and_get_output(*args),
  File "/usr/local/lib/python3.5/dist-packages/pytesseract/pytesseract.py", line 208, in run_and_get_output
    temp_name, input_filename = save_image(image)
  File "/usr/local/lib/python3.5/dist-packages/pytesseract/pytesseract.py", line 136, in save_image
    image.save(input_file_name, format=img_extension, **image.info)
  File "/usr/lib/python3/dist-packages/PIL/Image.py", line 1728, in save
    save_handler(self, fp, filename)
  File "/usr/lib/python3/dist-packages/PIL/GifImagePlugin.py", line 407, in _save
    _get_local_header(fp, im, (0, 0), flags)
  File "/usr/lib/python3/dist-packages/PIL/GifImagePlugin.py", line 441, in _get_local_header
    transparency = int(transparency)
TypeError: int() argument must be a string, a bytes-like object or a number, not 'tuple'

Run Code Online (Sandbox Code Playgroud)

我用convertbash中的命令将其转换。

convert  "/tmp/target.gif"   "/tmp/target.jpg"

Run Code Online (Sandbox Code Playgroud)

我表现出 /tmp/target.gif和/tmp/target.jpg这里。

然后再次执行上述python代码。

try:
    from PIL import Image
except ImportError:
    import Image
import pytesseract
print(pytesseract.image_to_string(Image.open('/tmp/target.jpg')))

Run Code Online (Sandbox Code Playgroud)

我什么都无法得到pytesseract.image_to_string(Image.open('/tmp/target.jpg'))，我得到空白字符。

对于Trenton_M的代码：

>>> img1 = remove_noise_and_smooth(r'/tmp/target.jpg')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 3, in remove_noise_and_smooth
AttributeError: 'NoneType' object has no attribute 'astype'
Thalish Sajeed

Run Code Online (Sandbox Code Playgroud)

对于Thalish Sajeed的代码：

忽略由引起的错误信息print(pytesseract.image_to_string(Image.open(filename)))。

Type "help", "copyright", "credits" or "license" for more information.
>>> from PIL import Image
>>> import pytesseract
>>> import matplotlib.pyplot as plt
>>> import cv2
>>> import numpy as np
>>> 
>>> 
>>> def display_image(filename, length_box=60, width_box=30):
...     if type(filename) == np.ndarray:
...         image = filename
...     else:
...         image = cv2.imread(filename)
...     plt.figure(figsize=(length_box, width_box))
...     plt.imshow(image, cmap="gray")
... 
>>> 
>>> filename = r"/tmp/target.jpg"
>>> display_image(filename)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 7, in display_image
  File "/usr/local/lib/python3.5/dist-packages/matplotlib/pyplot.py", line 2699, in imshow
    None else {}), **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/matplotlib/__init__.py", line 1810, in inner
    return func(ax, *args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/matplotlib/axes/_axes.py", line 5494, in imshow
    im.set_data(X)
  File "/usr/local/lib/python3.5/dist-packages/matplotlib/image.py", line 634, in set_data
    raise TypeError("Image data cannot be converted to float")
TypeError: Image data cannot be converted to float
>>>

Run Code Online (Sandbox Code Playgroud)

@Thalish Sajeed，为什么我9244K不0244k使用您的代码？这是我经过测试的示例文件。

提取的字符串。

@Trenton_M，更正您的代码中的一些错字和损失，并删除该行plt.show()作为您的建议。

>>> import cv2,pytesseract
>>> import numpy as np
>>> import matplotlib.pyplot as plt
>>> 
>>> 
>>> def image_smoothening(img):
...     ret1, th1 = cv2.threshold(img, 88, 255, cv2.THRESH_BINARY)
...     ret2, th2 = cv2.threshold(th1, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)
...     blur = cv2.GaussianBlur(th2, (5, 5), 0)
...     ret3, th3 = cv2.threshold(blur, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)
...     return th3
... 
>>> 
>>> def remove_noise_and_smooth(file_name):
...     img = cv2.imread(file_name, 0)
...     filtered = cv2.adaptiveThreshold(img, 255, cv2.ADAPTIVE_THRESH_MEAN_C, cv2.THRESH_BINARY, 9, 41)
...     kernel = np.ones((1, 1), np.uint8)
...     opening = cv2.morphologyEx(filtered, cv2.MORPH_OPEN, kernel)
...     closing = cv2.morphologyEx(opening, cv2.MORPH_CLOSE, kernel)
...     img = image_smoothening(img)
...     or_image = cv2.bitwise_or(img, closing)
...     return or_image
... 
>>> 
>>> cv2_thresh_list = [cv2.THRESH_BINARY, cv2.THRESH_TRUNC, cv2.THRESH_TOZERO]
>>> fn = r'/tmp/target.jpg'
>>> img1 = remove_noise_and_smooth(fn)
>>> img2 = cv2.imread(fn, 0)
>>> for i, img in enumerate([img1, img2]):
...     img_type = {0: 'Preprocessed Images\n',
...                 1: '\nUnprocessed Images\n'}
...     print(img_type[i])
...     for item in cv2_thresh_list:
...         print('Thresh: {}'.format(str(item)))
...         _, thresh = cv2.threshold(img, 127, 255, item)
...         plt.imshow(thresh, 'gray')
...         f_name = '{0}.jpg'.format(str(item))
...         plt.savefig(f_name)
...         print('OCR Result: {}\n'.format(pytesseract.image_to_string(f_name)))

Run Code Online (Sandbox Code Playgroud)

...预处理图像

在我的控制台中，所有输出信息如下：

Thresh: 0
<matplotlib.image.AxesImage object at 0x7fbc2519a6d8>
OCR Result: 10
15
20 

Ed??
10
2 o 30 40 so
so

Thresh: 2
<matplotlib.image.AxesImage object at 0x7fbc255e7eb8>
OCR Result: 10
15
20
Ed??
10
2 o 30 40 so
so
Thresh: 3
<matplotlib.image.AxesImage object at 0x7fbc25452fd0>
OCR Result: 10
15
20
Ed??
10
2 o 30 40 so
so
Unprocessed Images
Thresh: 0
<matplotlib.image.AxesImage object at 0x7fbc25464c88>
OCR Result: 10
15
20
Thresh: 2
<matplotlib.image.AxesImage object at 0x7fbc254520f0>
OCR Result: 10
15
2o
2o
30 40 50
Thresh: 3
<matplotlib.image.AxesImage object at 0x7fbc1e1968d0>
OCR Result: 10
15
20

Run Code Online (Sandbox Code Playgroud)

字符串在哪里0244R？

Answer 1

Tha*_*eed 5

让我们从 JPG 图像开始，因为 pytesseract 对 GIF 图像格式的操作存在问题。参考

filename = "/tmp/target.jpg"
image = cv2.imread(filename)
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
ret, threshold = cv2.threshold(gray,55, 255, cv2.THRESH_BINARY)
print(pytesseract.image_to_string(threshold))

Run Code Online (Sandbox Code Playgroud)

让我们尝试分解这里的问题。

您的图像噪声太大，tesseract 引擎无法识别字母，我们使用一些简单的图像处理技术，例如灰度和阈值处理来去除图像中的一些噪声。

然后当我们将它发送到 OCR 引擎时，我们看到字母被更准确地捕获。

如果你按照这个github 链接，你可以找到我测试过的笔记本

编辑 - 我已经用一些额外的图像清理技术更新了笔记本。源图像噪声太大，tesseract 无法直接在图像上开箱即用。您需要使用图像清理技术。

您可以改变阈值参数或将高斯模糊换成其他一些技术，直到获得所需的结果。

如果您希望在嘈杂的图像上运行 OCR - 请查看商业 OCR 提供商，例如google-cloud-vision。他们每月免费提供 1000 次 OCR 呼叫。

归档时间：	6 年，10 月前
查看次数：	482 次
最近记录：	6 年，10 月前