调整 pytesseract 参数

Question

调整 pytesseract 参数

She*_*don 6 ocr opencv image-processing python-tesseract

注意：我正在从 Data Science Stack Exchange 迁移这个问题，在那里它几乎没有受到关注。

我正在尝试实施 OCR 解决方案来识别从屏幕图片中读取的数字。

因为我正在处理深色背景，所以我首先反转图像，然后将其转换为灰度并对其进行阈值处理：

inverted_cropped_image = cv2.bitwise_not(cropped_image)
gray = get_grayscale(inverted_cropped_image)
thresholded_image = cv2.threshold(gray, 100, 255, cv2.THRESH_BINARY)[1]

Run Code Online (Sandbox Code Playgroud)

然后我调用 pytesseract 的image_to_data函数来输出包含不同文本区域及其置信区间的字典：

from pytesseract import Output
results = pytesseract.image_to_data(thresholded_image, output_type=Output.DICT)

Run Code Online (Sandbox Code Playgroud)

results最后，当它们的置信度超过用户定义的阈值 (70%) 时，我会迭代并绘制它们。让我困扰的是，我的脚本识别了图像中的所有内容，除了我想识别的数字（1227.938）。

我的第一个猜测是image_to_data参数设置不正确。

检查此网站，我选择了 11 的页面分割模式 ( psm)（稀疏文本）并尝试仅将数字列入白名单 ( tessedit_char_whitelist=0123456789m.')：

results = pytesseract.image_to_data(thresholded_image, config='--psm 11 --oem 3 -c tessedit_char_whitelist=0123456789m.', output_type=Output.DICT)

Run Code Online (Sandbox Code Playgroud)

唉，这更糟糕，脚本现在根本无法识别任何内容！

您有什么建议吗？我在这里遗漏了一些明显的东西吗？

编辑#1：

应 Ann Zen 的要求，以下是用于获取第一张图像的代码：

import imutils
import cv2
import matplotlib.pyplot as plt
import numpy as np
import pytesseract
from pytesseract import Output

def get_grayscale(image):
    return cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)

filename = "IMAGE.JPG"
cropped_image = cv2.imread(filename)
inverted_cropped_image = cv2.bitwise_not(cropped_image)

gray = get_grayscale(inverted_cropped_image)

thresholded_image = cv2.threshold(gray, 100, 255, cv2.THRESH_BINARY)[1]

results = pytesseract.image_to_data(thresholded_image, config='--psm 11 --oem 3 -c tessedit_char_whitelist=0123456789m.', output_type=Output.DICT)

color = (255, 255, 255)
for i in range(0, len(results["text"])):
    x = results["left"][i]
    y = results["top"][i]
    w = results["width"][i]
    h = results["height"][i]
    text = results["text"][i]
    conf = int(results["conf"][i])
    print("Confidence: {}".format(conf))
    if conf > 70:
        print("Confidence: {}".format(conf))
        print("Text: {}".format(text))
        print("")
        text = "".join([c if ord(c) < 128 else "" for c in text]).strip()
        cv2.rectangle(cropped_image, (x, y), (x + w, y + h), color, 2)
        cv2.putText(cropped_image, text, (x, y - 10), cv2.FONT_HERSHEY_SIMPLEX,1.2, color, 3)
cv2.imshow('Image', cropped_image)
cv2.waitKey(0)

Run Code Online (Sandbox Code Playgroud)

编辑#2：

我很少把声望值花得这么好！到目前为止发布的所有三个回复都帮助我完善了我的算法。

首先，我编写了一个 Tkinter 程序，允许我在感兴趣的数量周围手动裁剪图像（修改在这篇 SO 帖子中找到的图像）

然后我使用了 Ann Zen 的想法，缩小小数部分周围的搜索范围。我正在使用她漂亮的process功能来准备用于轮廓提取的灰度图像：contours, _ = cv2.findContours(process(img_gray), cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)。我使用 RETR_EXTERNAL 来避免处理重叠的边界矩形。

然后我从左到右对轮廓进行排序。超过用户定义阈值的边界矩形与积分部分（白色矩形）相关联；否则它们与小数部分（黑色矩形）相关联。

然后，我使用 Esraa 的方法提取字符，即在调用 Tesseract 之前应用高斯模糊。我使用了更大的内核（15x15 与 3x3）来实现此目的。

我还没有走出困境，但希望通过使用 Ahx 的自适应阈值我能得到更好的结果。

Answer 1

Esr*_*oud 1

我与 Tesseract 合作已经有一段时间了，所以让我为您澄清一些事情。如果您比任何其他计算机视觉项目更想识别文档中的文本，那么 Tesseract 会非常有用。通常需要二值化图像才能获得良好的输出。因此，您总是需要一些图像预处理。

然而，在过去对所有页面分割模式进行多次尝试后，我意识到当同一行的字体大小不同且没有空格时，它会失败。有时，如果差异较小，PSM 6 会很有帮助，但根据您的情况，您可以尝试其他选择。如果您不关心小数点，您可以尝试以下解决方案：

img = cv2.imread(r'E:\Downloads\Iwzrg.png')
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
img_blur = cv2.GaussianBlur(gray, (3,3),0)
_,thresh = cv2.threshold(img_blur,200,255,cv2.THRESH_BINARY_INV)

# If using a fixed camera
new_img = thresh[0:100, 80:320]

text = pytesseract.image_to_string(new_img, lang='eng', config='--psm 6 --oem 3 -c tessedit_char_whitelist=0123456789')

Run Code Online (Sandbox Code Playgroud)

输出： 1227

归档时间：	3 年，10 月前
查看次数：	5280 次
最近记录：	3 年，10 月前