How to recognize images within scanned PDF files?

Question

How to recognize images within scanned PDF files?

iOS*_*ner 3 python ocr machine-learning image-processing computer-vision

I am trying to identify images (as opposed to text) within scanned PDF files, ideally using python. Is there any way to do this? As a simple example, say you've scanned a chapter of a book. There are three possible options for a page:

Contains text only
Contains an image only (or multiple)
Contains both text and image(s)

我想输出属于类别 2 或 3 的页面列表。

Answer 1

Mar*_*ell 7

我的想法是寻找普通文本中不会出现的特征 - 可能是跨越多行的垂直黑色元素。我选择的工具是ImageMagick，它安装在大多数 Linux 发行版上，并且适用于 macOS 和 Windows。我只需在终端的命令提示符下运行它。

因此，我将使用此命令 - 请注意，我将原始页面添加到右侧已处理页面的左侧，并在周围放置红色边框以供说明：

magick page-28.png -alpha off +dither -colors 2 -colorspace gray -normalize -statistic median 1x200 result.png

Run Code Online (Sandbox Code Playgroud)

我明白了：

第25页.png

第26页.png

第27页.png

第28页.png

上面命令的解释...

在上面的命令中，我不是进行阈值处理，而是将颜色减少为 2 种颜色，然后转换为灰度，然后标准化 - 基本上应该选择黑色和背景色作为两种颜色，转换后它们将变成黑色和白色灰度并标准化。

然后我用一个 200 像素高的结构元素做一个中值滤波器，它比几行高 - 所以它应该识别高特征 - 垂直线。

解释完毕

继续进行...

因此，如果我反转图像，使黑色变为白色，白色变为黑色，然后取平均值并将其乘以图像中的像素总数，这将告诉我有多少像素是垂直特征的一部分：

convert page-28.png -alpha off +dither -colors 2 -colorspace gray -normalize -statistic median 1x200 -negate -format "%[fx:mean*w*h]" info:
90224

convert page-27.png -alpha off +dither -colors 2 -colorspace gray -normalize -statistic median 1x200 -negate -format "%[fx:mean*w*h]" info:
0

Run Code Online (Sandbox Code Playgroud)

所以第 28 页不是纯文本，第 27 页是。

这里有一些提示...

提示

您可以查看 PDF 有多少页，如下所示 - 尽管可能有更快的方法：

convert -density 18 book.pdf info:

Run Code Online (Sandbox Code Playgroud)

提示

您可以像这样提取 PDF 的页面：

convert -density 288 book.pdf[25] page-25.png

Run Code Online (Sandbox Code Playgroud)

提示

如果您正在制作多本书，您可能需要对图像进行标准化，以便它们全部都是 1000 像素高，然后结构元素的大小（用于计算中位数）应该相当一致。

归档时间：	8 年，7 月前
查看次数：	3681 次
最近记录：	4 年，6 月前