删除水平下划线

Question

删除水平下划线

Bra*_*mon 22 c++ python opencv tesseract cv2

我试图从包含死刑记录信息的几百个JPG中提取文字; JPG由德克萨斯州刑事司法部(TDCJ)主持.以下是删除了个人身份信息的示例代码段.

我已经确定下划线是对正确OCR的阻碍 -如果我进去,截取子片段和手动白化线,通过pytesseract产生的OCR 非常好.但由于强调存在,它非常糟糕.

我怎样才能最好地删除这些水平线？我尝试过的:

开始使用OpenCV doc的演练:使用形态学操作提取水平和垂直线.卡住得很快,因为我知道零C++.
随后删除图像中的水平线 - 最后是一个难以理解的字符串.
接下来使用OpenCV从边缘图像中删除长水平/垂直线 - 在这里无法获得调整零数组的直觉.

用c ++标记这个问题,希望有人可以帮助将文档演练的第5步翻译成Python.我已经尝试了一批转换,例如Hugh Line Transform,但是我在图书馆和区域内的黑暗中感觉到我以前没有任何经验.

import cv2

# Inverted grayscale
img = cv2.imread('rsnippet.jpg', cv2.IMREAD_GRAYSCALE)
img = cv2.bitwise_not(img)

# Transform inverted grayscale to binary
th = cv2.adaptiveThreshold(img, 255, cv2.ADAPTIVE_THRESH_MEAN_C,
                            cv2.THRESH_BINARY, 15, -2)

# An alternative; Not sure if `th` or `th2` is optimal here
th2 = cv2.threshold(img, 170, 255, cv2.THRESH_BINARY)[1]

# Create corresponding structure element for horizontal lines.
# Start by cloning th/th2.
horiz = th.copy()
r, c = horiz.shape

# Lost after here - not understanding intuition behind sizing/partitioning

Run Code Online (Sandbox Code Playgroud)

Answer 1

dha*_*hka 9

到目前为止,所有答案似乎都是使用形态学操作.这有点不同.如果线是水平的,这应该给出相当好的结果.

为此,我使用下面显示的样本图像的一部分.

加载图像,将其转换为灰度并反转.

import cv2
import numpy as np
import matplotlib.pyplot as plt

im = cv2.imread('sample.jpg')
gray = 255 - cv2.cvtColor(im, cv2.COLOR_BGR2GRAY)

Run Code Online (Sandbox Code Playgroud)

倒置灰度图像:

如果在此反转图像中扫描一行,您将看到其轮廓看起来有所不同,具体取决于是否存在线条.

plt.figure(1)
plt.plot(gray[18, :] > 16, 'g-')
plt.axis([0, gray.shape[1], 0, 1.1])
plt.figure(2)
plt.plot(gray[36, :] > 16, 'r-')
plt.axis([0, gray.shape[1], 0, 1.1])

Run Code Online (Sandbox Code Playgroud)

绿色的配置文件是没有下划线的行,红色是带下划线的行.如果你取每个配置文件的平均值,你会发现红色配置文件的平均值更高.

因此,使用此方法可以检测下划线并将其删除.

for row in range(gray.shape[0]):
    avg = np.average(gray[row, :] > 16)
    if avg > 0.9:
        cv2.line(im, (0, row), (gray.shape[1]-1, row), (0, 0, 255))
        cv2.line(gray, (0, row), (gray.shape[1]-1, row), (0, 0, 0), 1)

cv2.imshow("gray", 255 - gray)
cv2.imshow("im", im)

Run Code Online (Sandbox Code Playgroud)

以下是检测到的红色下划线和清理后的图像.

tesseract清理图像的输出:

Convthed as th(
shot once in the
she stepped fr<
brother-in-lawii
collect on life in
applied for man
to the scheme i|

Run Code Online (Sandbox Code Playgroud)

现在应该清楚使用部分图像的原因.由于个人身份信息已在原始图像中删除,因此阈值不起作用.但是,当您将其应用于处理时,这应该不是问题.有时您可能需要调整阈值(16,0.9).

删除了部分字母并且仍然保留了一些微弱的线条,结果看起来不太好.如果我可以改进它会更新.

更新:

有些改进; 清理并链接字母的缺失部分.我对代码进行了评论,所以我相信这个过程很清楚.您还可以检查生成的中间图像以查看其工作原理.结果好一点.

tesseract清理图像的输出:

Convicted as th(
shot once in the
she stepped fr<
brother-in-law. ‘
collect on life ix
applied for man
to the scheme i|

Run Code Online (Sandbox Code Playgroud)

tesseract清理图像的输出:

)r-hire of 29-year-old .
revolver in the garage ‘
red that the victim‘s h
{2000 to kill her. mum
250.000. Before the kil
If$| 50.000 each on bin
to police.

Run Code Online (Sandbox Code Playgroud)

python代码:

import cv2
import numpy as np
import matplotlib.pyplot as plt

im = cv2.imread('sample2.jpg')
gray = 255 - cv2.cvtColor(im, cv2.COLOR_BGR2GRAY)
# prepare a mask using Otsu threshold, then copy from original. this removes some noise
__, bw = cv2.threshold(cv2.dilate(gray, None), 128, 255, cv2.THRESH_BINARY or cv2.THRESH_OTSU)
gray = cv2.bitwise_and(gray, bw)
# make copy of the low-noise underlined image
grayu = gray.copy()
imcpy = im.copy()
# scan each row and remove lines
for row in range(gray.shape[0]):
    avg = np.average(gray[row, :] > 16)
    if avg > 0.9:
        cv2.line(im, (0, row), (gray.shape[1]-1, row), (0, 0, 255))
        cv2.line(gray, (0, row), (gray.shape[1]-1, row), (0, 0, 0), 1)

cont = gray.copy()
graycpy = gray.copy()
# after contour processing, the residual will contain small contours
residual = gray.copy()
# find contours
contours, hierarchy = cv2.findContours(cont, cv2.RETR_CCOMP, cv2.CHAIN_APPROX_SIMPLE)
for i in range(len(contours)):
    # find the boundingbox of the contour
    x, y, w, h = cv2.boundingRect(contours[i])
    if 10 < h:
        cv2.drawContours(im, contours, i, (0, 255, 0), -1)
        # if boundingbox height is higher than threshold, remove the contour from residual image
        cv2.drawContours(residual, contours, i, (0, 0, 0), -1)
    else:
        cv2.drawContours(im, contours, i, (255, 0, 0), -1)
        # if boundingbox height is less than or equal to threshold, remove the contour gray image
        cv2.drawContours(gray, contours, i, (0, 0, 0), -1)

# now the residual only contains small contours. open it to remove thin lines
st = cv2.getStructuringElement(cv2.MORPH_ELLIPSE, (3, 3))
residual = cv2.morphologyEx(residual, cv2.MORPH_OPEN, st, iterations=1)
# prepare a mask for residual components
__, residual = cv2.threshold(residual, 0, 255, cv2.THRESH_BINARY)

cv2.imshow("gray", gray)
cv2.imshow("residual", residual)   

# combine the residuals. we still need to link the residuals
combined = cv2.bitwise_or(cv2.bitwise_and(graycpy, residual), gray)
# link the residuals
st = cv2.getStructuringElement(cv2.MORPH_ELLIPSE, (1, 7))
linked = cv2.morphologyEx(combined, cv2.MORPH_CLOSE, st, iterations=1)
cv2.imshow("linked", linked)
# prepare a msak from linked image
__, mask = cv2.threshold(linked, 0, 255, cv2.THRESH_BINARY)
# copy region from low-noise underlined image
clean = 255 - cv2.bitwise_and(grayu, mask)
cv2.imshow("clean", clean)
cv2.imshow("im", im)

Run Code Online (Sandbox Code Playgroud)

Answer 2

Tai*_*Tai 8

人们可以试试这个.

img = cv2.imread('img_provided_by_op.jpg', 0)
img = cv2.bitwise_not(img)  

# (1) clean up noises
kernel_clean = np.ones((2,2),np.uint8)
cleaned = cv2.erode(img, kernel_clean, iterations=1)

# (2) Extract lines
kernel_line = np.ones((1, 5), np.uint8)  
clean_lines = cv2.erode(cleaned, kernel_line, iterations=6)
clean_lines = cv2.dilate(clean_lines, kernel_line, iterations=6)

# (3) Subtract lines
cleaned_img_without_lines = cleaned - clean_lines
cleaned_img_without_lines = cv2.bitwise_not(cleaned_img_without_lines)

plt.imshow(cleaned_img_without_lines)
plt.show()
cv2.imwrite('img_wanted.jpg', cleaned_img_without_lines)

Run Code Online (Sandbox Code Playgroud)

演示

该方法基于Zaw Lin 的回答.他/她识别出图像中的线条,然后进行减法以消除它们.但是,我们不能在这里减去线条,因为我们有字母e,t,E,T,-包含线条!如果我们只是从图像中减去水平线,e将几乎与c相同.-将会消失......

问:我们如何找到线路？

为了找到线,我们可以利用erode功能.要使用erode,我们需要定义一个内核.(您可以将内核视为功能操作的窗口/形状.)

该内核通过图像滑动(如在2D卷积).只有当内核下的所有像素都是1时,原始图像中的像素(1或0)才会被视为1,否则它将被侵蚀(变为零).- (来源)

为了提取线条,我们定义了一个内核,kernel_line如np.ones((1, 5)),[1, 1, 1, 1, 1].此内核将滑过图像并侵蚀内核下具有0的像素.

更具体地说,当内核应用于一个像素时,它将捕获左侧的两个像素和右侧的两个像素.

 [X X Y X X]
      ^
      |
Applied to Y, `kernel_line` captures Y's neighbors. If any of them is not
0, Y will be set to 0.

Run Code Online (Sandbox Code Playgroud)

水平线将保留在此内核下,而没有水平邻居的像素将消失.这就是我们如何使用以下行捕获行.

clean_lines = cv2.erode(cleaned, kernel_line, iterations=6)

Run Code Online (Sandbox Code Playgroud)

问:我们如何避免在e,E,t,T和 - ？中提取线？

我们将结合erosion并dilation与迭代参数.

clean_lines = cv2.erode(cleaned, kernel_line, iterations=6)

Run Code Online (Sandbox Code Playgroud)

您可能已经注意到了这个iterations=6部分.此参数的效果将使e,E,t,T, - 中的平坦部分消失.这是因为虽然我们多次应用相同的操作,但这些线的边界部分将会缩小.(应用相同的内核,只有边界部分会满足0并且结果变为0.)我们使用这个技巧使这些字符中的线条消失.

然而,这带来了副作用,即我们想要摆脱的长下划线部分也会缩小.我们可以用它来成长dilate!

clean_lines = cv2.dilate(clean_lines, kernel_line, iterations=6)

Run Code Online (Sandbox Code Playgroud)

与缩小图像的侵蚀相反,扩张使图像更大.虽然我们仍然拥有相同的内核,kernel_line但如果内核下的任何部分为1,则目标像素将为1.应用此内容时,边界将重新生长.(如果我们仔细挑选参数使其在侵蚀部分消失,那么e,E,t,T中的部分将不会再生长.)

通过这个额外的技巧,我们可以成功地摆脱线条而不会伤害e,E,t,T和 - .

归档时间：	7 年，12 月前
查看次数：	3246 次
最近记录：	7 年，11 月前