MrV*_*ary 4 python opencv image image-processing python-3.x
TL;DR:如何以不包含相邻(顶部和底部)段落的方式选择图像上的段落?
\n\n我有一组扫描图像,它们是单列文本,例如这个。这些图像都是黑白的,已经旋转,它们的噪音被减少,并且空白被修剪。
\n\n我想做的是将每个这样的图像分成段落。我最初的想法是测量每行的平均亮度,以找到文本行之间的空格,并尝试选择从该行开始的矩形以匹配缩进并测量该矩形的亮度。但这似乎有点麻烦。
\n\n而且,线条有时会稍微倾斜(最末端的垂直差异最大为 \xe2\x89\x88 10 px),因此有时会出现线条重叠。所以我想选择一个段落的所有字母并使用它们来绘制一个文本块,我使用这种方法得到了这个,但我不知道如何进一步进行。选择从左侧开始像素的每个字母矩形,并尝试包含开始不少于的每个矩形n
first_rectangle_x - offset
?的每个矩形。但那又怎样呢?
这是特定于所附段落结构的。我不确定您是否需要更通用的解决方案,但它可能需要额外的工作:
import cv2
import numpy as np
import matplotlib.pyplot as plt
image = cv2.imread('paragraphs.png', 0)
# find lines by horizontally blurring the image and thresholding
blur = cv2.blur(image, (91,9))
b_mean = np.mean(blur, axis=1)/256
# hist, bin_edges = np.histogram(b_mean, bins=100)
# threshold = bin_edges[66]
threshold = np.percentile(b_mean, 66)
t = b_mean > threshold
'''
get the image row numbers that has text (non zero)
a text line is a consecutive group of image rows that
are above the threshold and are defined by the first and
last row numbers
'''
tix = np.where(1-t)
tix = tix[0]
lines = []
start_ix = tix[0]
for ix in range(1, tix.shape[0]-1):
if tix[ix] == tix[ix-1] + 1:
continue
# identified gap between lines, close previous line and start a new one
end_ix = tix[ix-1]
lines.append([start_ix, end_ix])
start_ix = tix[ix]
end_ix = tix[-1]
lines.append([start_ix, end_ix])
l_starts = []
for line in lines:
center_y = int((line[0] + line[1]) / 2)
xx = 500
for x in range(0,500):
col = image[line[0]:line[1], x]
if np.min(col) < 64:
xx = x
break
l_starts.append(xx)
median_ls = np.median(l_starts)
paragraphs = []
p_start = lines[0][0]
for ix in range(1, len(lines)):
if l_starts[ix] > median_ls * 2:
p_end = lines[ix][0] - 10
paragraphs.append([p_start, p_end])
p_start = lines[ix][0]
p_img = np.array(image)
n_cols = p_img.shape[1]
for paragraph in paragraphs:
cv2.rectangle(p_img, (5, paragraph[0]), (n_cols - 5, paragraph[1]), (128, 128, 0), 5)
cv2.imwrite('paragraphs_out.png', p_img)
Run Code Online (Sandbox Code Playgroud)
输入输出
归档时间: |
|
查看次数: |
4004 次 |
最近记录: |