Pytesseract 提高 OCR 准确性

Question

Pytesseract 提高 OCR 准确性

Sus*_*hil 5 python ocr tesseract python-3.x pytesser

我想从中的图像中提取文本python。为了做到这一点，我选择了pytesseract。当我尝试从图像中提取文本时，结果并不令人满意。我也经历了这个并实现了列出的所有技术。然而，它的表现似乎并不好。

图像：

代码：

import pytesseract
import cv2
import numpy as np

img = cv2.imread('D:\\wordsimg.png')

img = cv2.resize(img, None, fx=1.2, fy=1.2, interpolation=cv2.INTER_CUBIC)

img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

kernel = np.ones((1,1), np.uint8)
img = cv2.dilate(img, kernel, iterations=1)
img = cv2.erode(img, kernel, iterations=1)

img = cv2.threshold(cv2.medianBlur(img, 3), 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]

pytesseract.pytesseract.tesseract_cmd = 'C:\\Program Files\\Tesseract-OCR\\tesseract.exe'
    
txt = pytesseract.image_to_string(img ,lang = 'eng')

txt = txt[:-1]

txt = txt.replace('\n',' ')

print(txt)

Run Code Online (Sandbox Code Playgroud)

输出：

t hose he large form might light another us should took mountai house n story important went own own thought girl over family look some much ask the under why miss point make mile grow do own school was

Run Code Online (Sandbox Code Playgroud)

即使是 1 个不需要的空间也会让我付出很大的代价。我希望结果 100% 准确。任何帮助，将不胜感激。谢谢！

Answer 1

bfr*_*ris 8

我将调整大小从 1.2 更改为 2，并删除了所有预处理。我使用 psm 11 和 psm 12 得到了很好的结果

import pytesseract
import cv2
import numpy as np

img = cv2.imread('wavy.png')

#  img = cv2.resize(img, None, fx=1.2, fy=1.2, interpolation=cv2.INTER_CUBIC)
img = cv2.resize(img, None, fx=2, fy=2)

img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

kernel = np.ones((1,1), np.uint8)
#  img = cv2.dilate(img, kernel, iterations=1)
#  img = cv2.erode(img, kernel, iterations=1)

#  img = cv2.threshold(cv2.medianBlur(img, 3), 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]

cv2.imwrite('thresh.png', img)

pytesseract.pytesseract.tesseract_cmd = 'C:\\Program Files (x86)\\Tesseract-OCR\\tesseract.exe'
    
for psm in range(6,13+1):
    config = '--oem 3 --psm %d' % psm
    txt = pytesseract.image_to_string(img, config = config, lang='eng')
    print('psm ', psm, ':',txt)

Run Code Online (Sandbox Code Playgroud)

该config = '--oem 3 --psm %d' % psm行使用字符串插值 (%) 运算符替换%d为整数 (psm)。我不太清楚它的oem作用，但我已经养成了使用它的习惯。更多内容请参见psm本答案的末尾。

psm  11 : those he large form might light another us should name

took mountain story important went own own thought girl

over family look some much ask the under why miss point

make mile grow do own school was

psm  12 : those he large form might light another us should name

took mountain story important went own own thought girl

over family look some much ask the under why miss point

make mile grow do own school was

Run Code Online (Sandbox Code Playgroud)

psm是页面分割模式的缩写。我不太确定不同的模式是什么。您可以从描述中了解代码的含义。您可以从以下位置获取列表tesseract --help-psm

Page segmentation modes:
  0    Orientation and script detection (OSD) only.
  1    Automatic page segmentation with OSD.
  2    Automatic page segmentation, but no OSD, or OCR. (not implemented)
  3    Fully automatic page segmentation, but no OSD. (Default)
  4    Assume a single column of text of variable sizes.
  5    Assume a single uniform block of vertically aligned text.
  6    Assume a single uniform block of text.
  7    Treat the image as a single text line.
  8    Treat the image as a single word.
  9    Treat the image as a single word in a circle.
 10    Treat the image as a single character.
 11    Sparse text. Find as much text as possible in no particular order.
 12    Sparse text with OSD.
 13    Raw line. Treat the image as a single text line,
       bypassing hacks that are Tesseract-specific.

Run Code Online (Sandbox Code Playgroud)

归档时间：	5 年，4 月前
查看次数：	10140 次
最近记录：	5 年，3 月前