查找没有文本的 PDF

Question

查找没有文本的 PDF

我有很多文件夹，里面有很多 PDF，我想用光学字符识别那些没有文本层的文件夹。所以首先，我想找到他们。我想也许一个管道pdfgrep可以完成这项工作，但我迷路了。

如何找到没有文本的 PDF？

Answer 1

是的，使用pdfgrep听起来是个好主意。就像是：

find . -name '*.[Pp][Dd][Ff]' -type f \
  ! -exec pdfgrep -q '\w' {} ';' -print

Run Code Online (Sandbox Code Playgroud)

将报告pdfgrep找不到任何单词字符（alnums 或下划线）的 pdf 文件列表。

（在某些find实现中，您可以使用-iname '*.pdf'而不是-name '*.[Pp][Dd][Ff]'上面的。请注意，它假定文件名是当前语言环境中的有效文本）

要查找少于 1000 个单词字符的文件：

find . -name '*.[Pp][Dd][Ff]' -type f -exec sh -c '
  for file do
    [ "$(pdfgrep -c "\w" "$file")" -lt 1000 ] &&
      printf "%s\n" "$file"
  done' sh {} +

Run Code Online (Sandbox Code Playgroud)

归档时间：	4 年，10 月前
查看次数：	76 次
最近记录：	4 年，10 月前