标签: python-tesseract

使用 pytesseract 检测孟加拉语字符

我正在尝试使用 python 从图像中检测孟加拉字符，所以我决定使用pytesseract。为此，我使用了以下代码：

import pytesseract
from PIL import Image, ImageEnhance, ImageFilter

im = Image.open("input.png") # the second one
im = im.filter(ImageFilter.MedianFilter())
enhancer = ImageEnhance.Contrast(im)
im = enhancer.enhance(2)
im = im.convert('1')
im.save('temp2.png')
pytesseract.pytesseract.tesseract_cmd = 'C:/Program Files (x86)/Tesseract-OCR/tesseract'
text = pytesseract.image_to_string(Image.open('temp2.png'),lang="ben")
print text

Run Code Online (Sandbox Code Playgroud)

问题是，如果我给出了英文字符的图像，则会被检测到。但是，当我lang="ben"从孟加拉字符的图像中编写和检测时，我的代码会运行无休止的时间或永远。

PS：我已将孟加拉语训练数据下载到 tessdata 文件夹，我正在尝试在PyCharm 中运行它。

谁能帮我解决这个问题？

input.png 的示例

python python-tesseract

Pia*_*nti

2017 03-28

2
推荐指数

1
解决办法

2649
查看次数

使用Python从具有两列或三列数据的图像中使用OCR读取图像中的文本

在示例图像中（仅作为参考，我的图像将具有相同的图案），一个页面具有完整的水平文本，其他页面具有两个水平文本列。

如何在python中自动检测文档的模式并逐一读取另一列数据？

我将 Tesseract OCR 与 Psm 6 一起使用，它是水平读取的，这是错误的。

python ocr tesseract python-2.7 python-tesseract

作者

2018 04-14

2
推荐指数

1
解决办法

3307
查看次数

如何将 pytesseract 部署到 Heroku

我有一个 Python 应用程序，它在我的机器上通过 Localhost 写得很好。

我正在尝试将其部署到 Heroku。然而，这似乎不可能完成（我现在已经花了大约 30 个小时尝试）。

问题是 Tesseract OCR。我正在使用 pytesseract 包装器，我的代码利用了它。但是，无论我尝试什么，在将pytesseract上传到Heroku时似乎都无法使用它。

任何人都可以建议如何通过 pytesseract 将 Hello World Tesseract OCR Python 应用程序部署到 Heroku，或者如果 Heroku 无法做到这一点，请提出 Heroku 的替代方案？

opencv tesseract heroku python-3.x python-tesseract

use*_*126

lucky-day

2
推荐指数

2
解决办法

2744
查看次数

改进 pytesseract 从图像中正确识别文本

我正在尝试使用pytesseract模块读取验证码。它大部分时间都提供准确的文本，但并非总是如此。

这是读取图像、操作图像和从图像中提取文本的代码。

import cv2
import numpy as np
import pytesseract

def read_captcha():
    # opencv loads the image in BGR, convert it to RGB
    img = cv2.cvtColor(cv2.imread('captcha.png'), cv2.COLOR_BGR2RGB)

    lower_white = np.array([200, 200, 200], dtype=np.uint8)
    upper_white = np.array([255, 255, 255], dtype=np.uint8)

    mask = cv2.inRange(img, lower_white, upper_white)  # could also use threshold
    mask = cv2.morphologyEx(mask, cv2.MORPH_OPEN, cv2.getStructuringElement(cv2.MORPH_ELLIPSE, (3, 3)))  # "erase" the small white points in the resulting mask
    mask = cv2.bitwise_not(mask)  # invert mask

    # load background (could be an …

Run Code Online (Sandbox Code Playgroud)

python ocr opencv image-processing python-tesseract

Ton*_*ana

2019 07-26

2
推荐指数

1
解决办法

3020
查看次数

多页 Tiff 图像的 PyTesseract 错误

当我读取 15 页的多页 Tiff 图像并且是白色背景的黑色字母/单词的文档时，PyTesseract 在我循环页面并转换为字符串的步骤中抛出“OSError：-9”错误。

我将 pytesseract 包与 pyocr.builders 一起使用。单页似乎工作正常，但我相信当图像不是 RGB 时程序会转换为 RGB 时会出现错误。

img = Image.open(r'\users\ai\text.tiff')
img.load()
txt = ""
for frame in range(0, img.n_frames):
    img.seek(frame)
    txt += tool.image_to_string(img,builder=pyocr.builders.TextBuilder())

Run Code Online (Sandbox Code Playgroud)

预期输出是 jupyter 窗口中显示的所有 15 页。

错误信息

---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
<ipython-input-17-e59bdf3b773c> in <module>
      2 for frame in range(0, img.n_frames):
      3     img.seek(frame)
----> 4     txt += tool.image_to_string(img,builder=pyocr.builders.TextBuilder())
      5 

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pyocr\tesseract.py in image_to_string(image, lang, builder)
    357     with tempfile.TemporaryDirectory() as tmpdir:
    358         if image.mode != "RGB":
--> 359             image = …

Run Code Online (Sandbox Code Playgroud)

python-3.x python-tesseract

ESL*_*ner

lucky-day

2
推荐指数

1
解决办法

1488
查看次数

使用python检查pytesseract版本

有没有办法在python中检查pytesseract版本？
根据pytesseract的PyPi文档，有一个内置函数get_tesseract_version来获取pytesseract版本。但是当我在 python 中运行它时，我得到以下信息：

>>> import pytesseract
>>> pytesseract.get_tesseract_version
<function get_tesseract_version at 0x7f4b9edd4598>
>>> print(pytesseract.get_tesseract_version)
<function get_tesseract_version at 0x7f4b9edd4598>

Run Code Online (Sandbox Code Playgroud)

我知道我可以使用 pytesseract 版本pip freeze，但我想使用 python 获得它。那可能吗？

python python-tesseract

sin*_*ium

lucky-day

2
推荐指数

2
解决办法

7229
查看次数

OCR 的背景图像清理

通过tesseract-OCR，我试图从以下带有红色背景的图像中提取文本。

我在提取框 B 和 D 中的文本时遇到问题，因为有垂直线。我怎样才能像这样清理背景：

输入：

输出：

一些想法？没有框的图像：

python ocr image image-processing python-tesseract

ste*_*dv

2019 11-01

2
推荐指数

1
解决办法

1779
查看次数

使用 Tesseract OCR 4.x 保留缩进

我正在努力使用 Tesseract OCR。我有一张血液检查图像，它有一张带压痕的表格。尽管 tesseract 能够很好地识别字符，但其结构并未保留在最终输出中。例如，查看“Emocromo con formula”（英文翻译：带有公式的血细胞计数）下面的缩进行。我想保留那个缩进。

我阅读了其他相关讨论并找到了选项preserve_interword_spaces=1。结果稍微好一点，但正如您所看到的，它并不完美。

有什么建议？

更新：

我尝试了 Tesseract v5.0，结果是一样的。

代码：

Tesseract 版本是 4.0.0.20190314

from PIL import Image
import pytesseract

# Preserve interword spaces is set to 1, oem = 1 is LSTM, 
# PSM = 1 is Automatic page segmentation with OSD - Orientation and script detection

custom_config = r'-c preserve_interword_spaces=1 --oem 1 --psm 1 -l eng+ita'

# default_config = r'-c -l eng+ita'

extracted_text = pytesseract.image_to_string(Image.open('referto-1.jpg'), config=custom_config)

print(extracted_text)

# saving to a txt …

Run Code Online (Sandbox Code Playgroud)

python ocr tesseract computer-vision python-tesseract

Ale*_*ini

2020 01-06

2
推荐指数

1
解决办法

2176
查看次数

错误：如何修复 Opencv python 中的“SystemError: <built-in function imshow> returned NULL without setting an error”

我正在研究一个项目使用计算机视觉从发票中提取数据，我正在尝试使用 opencv 和 pytesseract 从图像发票中提取数据，并进一步Regex将原始数据分离到不同的部分，如日期、供应商名称、发票编号，项目名称和项目数量。开始时我试图提取日期但遇到错误。

这是我的代码

import pytesseract
from pytesseract import Output
import cv2

img = cv2.imread('invoice.png')
d = pytesseract.image_to_data(img, output_type=Output.DICT)
n_boxes = len(d['level'])
for i in range(n_boxes):
    (x, y, w, h) = (d['left'][i], d['top'][i], d['width'][i], d['height'][i])
    img = cv2.rectangle(img, (x, y), (x + w, y + h), (0, 0, 255), 2)

cv2.imshow(img,'img')

Run Code Online (Sandbox Code Playgroud)

但我收到这个错误

File "testpdf3.py", line 12, in <module>
    cv2.imshow(img,'img')
SystemError: <built-in function imshow> returned NULL without setting an error

Run Code Online (Sandbox Code Playgroud)

ocr opencv python-3.x python-tesseract

May*_*tav

lucky-day

2
推荐指数

1
解决办法

8651
查看次数

无法使用 pytesseract.image_to_string 从 Image 读取文本

这里的问题是我需要删除行并编写代码来识别字符。到目前为止，我已经看到了解决方案，其中 char 是实心的，但它具有带双边框的 char。

python captcha opencv python-imaging-library python-tesseract

vis*_*219

2021 05-27

2
推荐指数

1
解决办法

183
查看次数

标签统计

python-tesseract ×10

python ×7

ocr ×5

opencv ×4

python-3.x ×3

tesseract ×3

image-processing ×2

captcha ×1

computer-vision ×1

heroku ×1

image ×1

python-2.7 ×1

python-imaging-library ×1

标签 统计

标签统计