如何检查PDF是扫描图像还是包含文本

Question

如何检查PDF是扫描图像还是包含文本

Jin*_*eph 21 python python-3.x pdf-extraction pdfminer pypdf2

我有大量文件，其中一些是扫描图像为 PDF，一些是完整/部分文本 PDF。

有没有办法检查这些文件，以确保我们只处理扫描图像的文件，而不是完整/部分文本 PDF 文件？

环境：Python 3.6

Answer 1

Rah*_*wal 15

下面的代码将起作用，从可搜索和不可搜索的 PDF 中提取数据文本数据。

import fitz

text = ""
path = "Your_scanned_or_partial_scanned.pdf"

doc = fitz.open(path)
for page in doc:
    text += page.getText()

Run Code Online (Sandbox Code Playgroud)

如果你没有fitz模块，你需要这样做：

pip install --upgrade pymupdf

感谢您的回复，但我的问题是，如果用户上传 pdf 文档，我将如何检查它是扫描文档还是文本文档。@拉胡尔·阿加瓦尔 (4认同)

Answer 2

Vit*_*ile 13

建立在Rahul Agarwal 的解决方案之上，以及我在此链接中找到的一些片段，这里有一个可能的算法来解决您的问题。

您需要安装fitz和PyMuPDF模块。你可以通过以下方式做到pip：

pip3 install fitz PyMuPDF

Run Code Online (Sandbox Code Playgroud)

这是 Python3 实现：

pip3 install fitz PyMuPDF

Run Code Online (Sandbox Code Playgroud)

尽管这回答了您的问题（即区分完全扫描和完全/部分文本 PDF），但该解决方案无法区分全文 PDF 和其中也包含文本的扫描 PDF。

Answer 3

小智 10

def get_pdf_searchable_pages(fname):
    # pip install pdfminer
    from pdfminer.pdfpage import PDFPage
    searchable_pages = []
    non_searchable_pages = []
    page_num = 0
    with open(fname, 'rb') as infile:

        for page in PDFPage.get_pages(infile):
            page_num += 1
            if 'Font' in page.resources.keys():
                searchable_pages.append(page_num)
            else:
                non_searchable_pages.append(page_num)
    if page_num > 0:
        if len(searchable_pages) == 0:
            print(f"Document '{fname}' has {page_num} page(s). "
                  f"Complete document is non-searchable")
        elif len(non_searchable_pages) == 0:
            print(f"Document '{fname}' has {page_num} page(s). "
                  f"Complete document is searchable")
        else:
            print(f"searchable_pages : {searchable_pages}")
            print(f"non_searchable_pages : {non_searchable_pages}")
    else:
        print(f"Not a valid document")


if __name__ == '__main__':
    get_pdf_searchable_pages("1.pdf")
    get_pdf_searchable_pages("1Scanned.pdf")

Run Code Online (Sandbox Code Playgroud)

输出：

Document '1.pdf' has 1 page(s). Complete document is searchable
Document '1Scanned.pdf' has 1 page(s). Complete document is non-searchable

Run Code Online (Sandbox Code Playgroud)

Answer 4

小智 9

试试OCRmyPDF。您可以使用此命令将扫描的 pdf 转换为数字 pdf。

ocrmypdf input_scanned.pdf output_digital.pdf

如果输入的 pdf 是数字的，该命令将抛出错误“PriorOcrFoundError：页面已经有文本！”。

import subprocess as sp
import re

output = sp.getoutput("ocrmypdf input.pdf output.pdf")
if not re.search("PriorOcrFoundError: page already has text!",output):
   print("Uploaded scanned pdf")
else:
   print("Uploaded digital pdf")

Run Code Online (Sandbox Code Playgroud)

Answer 5

m.b*_*han 6

您可以使用 pdfplumber。如果以下代码返回“无”，则它是扫描的 pdf，否则它是可搜索的。

    pip install pdfplumber

    with pdfplumber.open(file_name) as pdf:
        page = pdf.pages[0]
        text = page.extract_text()
        print(text)

Run Code Online (Sandbox Code Playgroud)

要从扫描的 pdf 中提取文本，您可以使用 OCRmyPDF。非常简单的包装，一行解决方案。你可以找到更多的包在这里，解释一个例子的视频在这里。如果有帮助，请为答案点赞。祝你好运！

Answer 6

Ext*_*com 5

PDF元数据检查怎么样'/Resources'？！

我相信对于 PDF（电子文档）中的任何文本，都有更多的机会拥有字体，尤其是 PDF，其目的是制作可移植文件，因此，它保留了字体定义。

如果您是PyPDF2用户，请尝试

pdf_reader = PyPDF2.PdfFileReader(input_file_location)
page_data = pdf_reader.getPage(page_num)

page_resources = page_data["/Resources"]

if "/Font" in page_resources:
    print(
        "[Info]: Looks like there is text in the PDF, contains:",
        page_resources.keys(),
    )
elif len(page_resources.get("/XObject", {})) != 1:
    print("[Info]: PDF Contains:", page_resources.keys())

    x_object = page_resources.get("/XObject", {})

    for obj in x_object:
        obj_ = x_object[obj]
        if obj_["/Subtype"] == "/Image":
            print("[Info]: PDF is image only")

Run Code Online (Sandbox Code Playgroud)

一般来说，这不是一个好的解决方案。如果 PDF 包含文本，则它必须包含字体，但如果 PDF 不包含文本，它仍然可以包含字体。字体的存在表明存在文本，但不能保证。 (2认同)

Answer 7

Joh*_*ter 5

我创建了一个脚本来检测 PDF 是否是 OCRd。主要思想：在 OCRd PDF 中，文本是不可见的。

测试给定 PDF ( f1)是否为 OCRd 的算法：

创建副本的f1标注为f2
删除所有文字 f2
为所有（或仅少数）页面创建图像 (PNG)f1和f2
f1是OCRD如果所有图像f1和f2是相同的。

https://github.com/jfilter/pdf-scripts/blob/master/is_ocrd_pdf.sh

#!/usr/bin/env bash
set -e
set -x

################################################################################
# Check if a PDF was scanned or created digitally, works on OCRd PDFs
#
# Usage:
#   bash is_scanned_pdf.sh [-p] file
#
#   Exit 0: Yes, file is a scanned PDF
#   Exit 99: No, file was created digitally
#
# Arguments:
#   -p or --pages: pos. integer, only consider first N pages
#
# Please report issues at https://github.com/jfilter/pdf-scripts/issues
#
# GPLv3, Copyright (c) 2020 Johannes Filter
################################################################################

# parse arguments
# h/t https://stackoverflow.com/a/33826763/4028896
max_pages=-1
# skip over positional argument of the file(s), thus -gt 1
while [[ "$#" -gt 1 ]]; do
  case $1 in
  -p | --pages)
    max_pages="$2"
    shift
    ;;
  *)
    echo "Unknown parameter passed: $1"
    exit 1
    ;;
  esac
  shift
done

# increment to make it easier with page numbering
max_pages=$((max_pages++))

command_exists() {
  if ! [ -x $($(command -v $1 &>/dev/null)) ]; then
    echo $(error: $1 is not installed.) >&2
    exit 1
  fi
}

command_exists mutool && command_exists gs && command_exists compare
command_exists pdfinfo

orig=$PWD
num_pages=$(pdfinfo $1 | grep Pages | awk '{print $2}')

echo $num_pages

echo $max_pages

if ((($max_pages > 1) && ($max_pages < $num_pages))); then
  num_pages=$max_pages
fi

cd $(mktemp -d)

for ((i = 1; i <= num_pages; i++)); do
  mkdir -p output/$i && echo $i
done

# important to filter text on output of GS (tmp1), cuz GS alters input PDF...
gs -o tmp1.pdf -sDEVICE=pdfwrite -dLastPage=$num_pages $1 &>/dev/null
gs -o tmp2.pdf -sDEVICE=pdfwrite -dFILTERTEXT tmp1.pdf &>/dev/null
mutool convert -o output/%d/1.png tmp1.pdf 2>/dev/null
mutool convert -o output/%d/2.png tmp2.pdf 2>/dev/null

for ((i = 1; i <= num_pages; i++)); do
  echo $i
  # difference in pixels, if 0 there are the same pictures
  # discard diff image
  if ! compare -metric AE output/$i/1.png output/$i/2.png null: 2>&1; then
    echo " pixels difference, not a scanned PDF, mismatch on page $i"
    exit 99
  fi
done

Run Code Online (Sandbox Code Playgroud)

归档时间：	6 年，10 月前
查看次数：	15725 次
最近记录：	4 年，5 月前