验证/验证 PDF 文件的完整性

Question

验证/验证 PDF 文件的完整性

Adm*_*ral 14 pdf validation file-corruption

是否有任何工具可以在 PDF 档案（所有目录）上运行并最终列出/识别损坏/无效的 PDF？

我的计算机（Windows 机器）上有数百个 PDF 文件（与文档等相关），而且我经常收到/必须通过电子邮件发送数十个 PDF。现在，我接收或发送的 PDF 已损坏已是正常情况。当源文件（例如 Word 文件或 Tex 文件）丢失/无法立即使用时，这有时会造成严重的麻烦。

在有限的时间内检查这数千个 PDF 是不可能的，所以我搜索了一个可以运行一次的工具，它会扫描所有 PDF（在目录和子目录中），最后我会得到一个文件列表应该重新创建。到目前为止，似乎还没有这样的工具。

Answer 1

Mub*_*hid 7

使用 PDFtk 可以很容易地检查 PDF 文件是否有效。PDF Labs提供了 PDFtk的免费 GUI。当您运行此工具时，您可以从多个目录加载任意数量的 PDF（通过使用“添加文件”按钮），然后它将开始非常快速地访问这些 PDF 文件中的页面。

如果所选 PDF 中的任何文件不是有效的 PDF，此实用程序将显示有关错误的消息，并将其从选择窗口中自动删除。

因此，您可以通过 PDFtk 使用此过程节省很多时间。此外，如果您有多核 CPU，您可以运行此实用程序的多个实例，并在每个实例中放入数百个 PDF。

我从去年开始使用这个软件，它是我用过的最方便的 PDF 工具。

或者，使用 marcwho 提到的链接中提供的工具 (pdfinfo.exe)，您可以`cd` 进入 `FolderContainingPDFs` 并在 Windows shell 中运行以下命令，它将在日志文件中标记无效的 PDF 文件：`FORFILES /S /M *.pdf /C "cmd /c echo. & echo @path @fname & D:\XPDF_3.04\bin64\pdfinfo.exe @file" 1>text.txt 2>&1` (2认同)

Answer 2

小智 5

我已经使用 xpdfbin-win 包和 cpdf.exe 中的“pdfinfo.exe”来检查 PDF 文件是否损坏，但如果没有必要，我不想涉及二进制文件。

我读到较新的 PDF 格式在末尾有一个可读的 xml 数据目录，所以我用常规窗口 NOTEPAD.exe 打开 PDF 并向下滚动到不可读的数据到最后，看到几个可读的键。我只需要一个密钥，但选择同时使用 CreationDate 和 ModDate。

以下 Powershell (PS) 脚本将检查当前目录中的所有 PDF 文件，并将每个文件的状态输出到文本文件 (!RESULTS.log) 中。对 35,000 个 PDF 文件运行此程序大约需要 2 分钟。我试图为那些不熟悉 PS 的人添加评论。希望这可以节省一些时间。可能有更好的方法来做到这一点，但这对我的目的来说完美无缺，并且可以默默地处理错误。您可能需要在开始时定义以下内容： $ErrorActionPreference = "SilentlyContinue" 如果您在屏幕上看到错误。

将以下内容复制到文本文件中并适当命名（例如：CheckPDF.ps1）或打开 PS 并浏览到包含 PDF 文件的目录以检查并将其粘贴到控制台中。

#
# PowerShell v4.0
#
# Get all PDF files in current directory
#
$items = Get-ChildItem | Where-Object {$_.Extension -eq ".pdf"}

$logFile = "!RESULTS.log"
$badCounter = 0
$goodCounter = 0
$msg = "`n`nProcessing " + $items.count + " files... "
Write-Host -nonewline -foregroundcolor Yellow $msg
foreach ($item in $items)
{
    #
    # Suppress error messages
    #
    trap { Write-Output "Error trapped"; continue; }

    #
    # Read raw PDF data
    #
    $pdfText = Get-Content $item -raw

    #
    # Find string (near end of PDF file), if BAD file, ptr will be undefined or 0
    #
    $ptr1 = $pdfText.IndexOf("CreationDate")
    $ptr2 = $pdfText.IndexOf("ModDate")

    #
    # Grab raw dates from file - will ERR if ptr is 0
    #
    try { $cDate = $pdfText.SubString($ptr1, 37); $mDate = $pdfText.SubString($ptr2, 31); }

    #
    # Append filename and bad status to logfile and increment a counter
    # catch block is also where you would rename, move, or delete bad files.
    #
    catch { "*** $item is Broken ***" >> $logFile; $badCounter += 1; continue; }

    #
    # Append filename and good status to logfile
    #
    Write-Output "$item - OK" -EA "Stop" >> $logFile

    #
    # Increment a counter
    #
    $goodCounter += 1
}
#
# Calculate total
#
$totalCounter = $badCounter + $goodCounter

#
# Append 3 blank lines to end of logfile
#
1..3 | %{ Write-Output "" >> $logFile }

#
# Append statistics to end of logfile
#
Write-Output "Total: $totalCounter / BAD: $badCounter / GOOD: $goodCounter" >> $logFile
Write-Output "DONE!`n`n"

Run Code Online (Sandbox Code Playgroud)

Answer 3

wp7*_*8de 5

跟随 @n0nuf 的脚步，我编写了一个批处理脚本，用 pdfinfo 检查特定文件夹中的所有 PDF，并在损坏时将其推送到 cpdf，以尝试修复它们：

@ECHO OFF
FOR %%f in (*.PDF) DO (
    echo %%f
    pdfinfo "%%f" 2>&1 | findstr /I "error"  >nul 2>&1
    if not errorlevel 1 (
        echo "bad -> try to fix"
        @cpdf -i %%f -o %%f_.pdf 2>NUL
        mv %%f .\\bak\\%%f
    ) else (
       REM echo good        
    )
)
@ECHO ON

Run Code Online (Sandbox Code Playgroud)

或者与 bash 脚本相同：

for file in $(find . -iname "*.pdf")
do
    echo "$file"
    pdfinfo "$file" 2>&1 | grep -i 'error' &> /dev/null
    if [ $? == 0 ]; then
       echo "broken -> try to fix"
       cpdf -i "$file" -o "$file"_.pdf
    fi
done

Run Code Online (Sandbox Code Playgroud)

损坏的 PDF 将被移动到子文件夹 \bak 中，重新创建的 PDF 将获得后缀 _.pdf （这并不完美，但对我来说已经足够了）。注意：重新创建的 PDF 包含的错误较少，并且应该可以使用常规 PDF 查看器进行查看。但这并不意味着您可以取回所有内容。不可恢复的内容会导致空白页面。

我还按照 @kraftydevil 的建议，尝试使用 JHOVE（开源文件格式识别、验证和表征工具）进行相同的操作：使用 Linux 上的命令行检查 PDF 文件是否已损坏，现在可以确认这也是一种有效的方法。（一开始我的成功率较低。但后来我注意到我没有正确处理 JHOVE 的输出。）

为了测试这两种方法，我使用文本编辑器删除并更改了 PDF 中的随机部分（删除了流，因此页面无法在我的 PDF 查看器中呈现，更改了 PDF 标签，并移动了一些位）。结果是：pdfinfo 和 JHOVE 都能够正确发现损坏的文件（JHOVE 在某些情况下甚至更加敏感）。

这是 JHOVE 的等效脚本：

@ECHO OFF
FOR %%f in (*.PDF) DO (
    echo %%f
    "C:\Program Files (x86)\JHOVE\jhove.bat" -m pdf-hul %%f | findstr /C:"Well-Formed and valid" >nul 2>&1
    if not errorlevel 1 (
        echo good
    ) else (
        echo "bad -> try to fix"
        @cpdf -i %%f -o %%f_.pdf 2>NUL
        REM mv %%f .\\bak\\%%f
    )
)
@ECHO ON

Run Code Online (Sandbox Code Playgroud)

归档时间：	11 年，7 月前
查看次数：	38775 次
最近记录：	5 年，1 月前