如何编写shell脚本来查找PDF格式的页数？

Question

如何编写shell脚本来查找PDF格式的页数？

我正在动态生成PDF.如何使用shell脚本检查PDF中的页数？

Answer 1

没有任何额外的包:

foo=$(strings < pdffile.pdf | sed -n 's|.*/Count -\{0,1\}\([0-9]\{1,\}\).*|\1|p' | sort -rn | head -n 1)

Run Code Online (Sandbox Code Playgroud)

使用pdfinfo:

foo=$(pdfinfo pdffile.pdf | grep Pages | awk '{print $2}')

Run Code Online (Sandbox Code Playgroud)

使用pdftk:

foo=$(pdftk pdffile.pdf dump_data|grep NumberOfPages| awk '{print $2}')

Run Code Online (Sandbox Code Playgroud)

我发现只有shell的方法并不总是可靠的.我有PDF文件,只有一个页面有几个/ Count,其中有不同的数字.我建议使用另外两种方法. (2认同)
您可以使用 grep 的 \K 运算符来获取页数，而无需使用 awk。要执行的命令是`pdfinfo file.pdf | grep -Po '页数：[[:空格:]]+\K[[:数字:]]+'`. (2认同)

Answer 2

np0*_*p0x 8

imagemagick库提供了一个名为identify的工具,它与计算输出行数相结合,可以获得你所追求的... imagemagick是一个易于安装在osx上的brew.

这是一个功能性的bash脚本,它将它捕获到shell变量并将其转储回屏幕......

#/bin/bash
pdfFile=$1
echo "Processing $pdfFile"
numberOfPages=$(/usr/local/bin/identify "$pdfFile" 2>/dev/null | wc -l | tr -d ' ')
#Identify gets info for each page, dump stderr to dev null
#count the lines of output
#trim the whitespace from the wc -l outout
echo "The number of pages is: $numberOfPages"

Run Code Online (Sandbox Code Playgroud)

并运行它的输出......

$ ./countPages.sh aSampleFile.pdf 
Processing aSampleFile.pdf
The number of pages is: 2
$

Run Code Online (Sandbox Code Playgroud)

Answer 3

Mar*_*ert 8

这是直接用于命令行的版本（基于 pdfinfo）：

for f in *.pdf; do pdfinfo "$f" | grep Pages | awk '{print $2}'; done

Run Code Online (Sandbox Code Playgroud)

Answer 4

Lac*_*bus 5

该pdftotext实用程序将pdf文件转换为在页面之间插入分页符的文本格式.(又名:换页字符$'\f'):

NAME
       pdftotext - Portable Document Format (PDF) to text converter.

SYNOPSIS
       pdftotext [options] [PDF-file [text-file]]

DESCRIPTION
       Pdftotext converts Portable Document Format (PDF) files to plain text.

       Pdftotext  reads  the PDF file, PDF-file, and writes a text file, text-file.  If text-file is
       not specified, pdftotext converts file.pdf to file.txt.  If text-file is  ´-',  the  text  is
       sent to stdout.

Run Code Online (Sandbox Code Playgroud)

有许多组合可以解决您的问题,请选择其中一个:

1)pdftotext + grep:

$ pdftotext file.pdf - | grep -c $'\f'

2)pdftotext + awk(v1):

$ pdftotext file.pdf - | awk 'BEGIN{n=0} {if(index($0,"\f")){n++}} END{print n}'

3)pdftotext + awk(v2):

$ pdftotext sample.pdf - | awk 'BEGIN{ RS="\f" } END{ print NR }'

4)pdftotext + awk(v3):

$ pdftotext sample.pdf - | awk -v RS="\f" 'END{ print NR }'

希望能帮助到你!

Answer 5

Gab*_*les 5

这是使用的完整破解pdftoppm，它预装在 Ubuntu 上（至少在 Ubuntu 18.04 和 20.04 上测试过）：

# for a pdf withOUT a password
pdftoppm mypdf.pdf -f 1000000 2>&1 | grep -o '([0-9]*)\.$' \
| grep -o '[0-9]*'

# for a pdf WITH a password which is `1234`
pdftoppm -upw 1234 mypdf.pdf -f 1000000 2>&1 | grep -o '([0-9]*)\.$' \
| grep -o '[0-9]*'

Run Code Online (Sandbox Code Playgroud)

这是如何运作的？好吧，如果您指定的f第一页大于 PDF 中的页面（我指定页码1000000，这对于所有已知的 PDF 来说都太大了），它将打印以下错误stderr：

给出的页面范围错误：第一页 (1000000) 不能位于最后一页 (142) 之后。

因此，我将该stderr消息通过管道传输到stdoutwith 2>&1，如此处所述，然后我通过管道将其传输到 grep 以匹配(142).具有此正则表达式 ( ([0-9]*)\.$) 的部分，然后我使用此正则表达式 ( [0-9]*) 再次将其传输到 grep 以查找数字，其中在142这种情况下。就是这样！

包装器功能和速度测试

这里有几个包装函数来测试这些：

# get the total number of pages in a PDF; technique 1.
# See this ans here: /sf/answers/1031561541/
# Usage (works on ALL PDFs--whether password-protected or not!):
#       num_pgs="$(getNumPgsInPdf "path/to/mypdf.pdf")"
# SUPER SLOW! Putting `time` just in front of the `strings` cmd shows it takes ~0.200 sec on a 142
# pg PDF!
getNumPgsInPdf() {
    _pdf="$1"

    _num_pgs="$(strings < "$_pdf" | sed -n 's|.*/Count -\{0,1\}\([0-9]\{1,\}\).*|\1|p' \
        | sort -rn | head -n 1)"

    echo "$_num_pgs"
}

# get the total number of pages in a PDF; technique 2.
# See my ans here: /sf/answers/4687430541/
# Usage, where `pw` is some password, if the PDF is password-protected (leave this off for PDFs
# with no password):
#       num_pgs="$(getNumPgsInPdf2 "path/to/mypdf.pdf" "pw")"
# SUPER FAST! Putting `time` just in front of the `pdftoppm` cmd shows it takes ~0.020 sec OR LESS
# on a 142 pg PDF!
getNumPgsInPdf2() {
    _pdf="$1"
    _password="$2"

    if [ -n "$_password" ]; then
        _password="-upw $_password"
    fi

    _num_pgs="$(pdftoppm $_password "$_pdf" -f 1000000 2>&1 | grep -o '([0-9]*)\.$' \
        | grep -o '[0-9]*')"

    echo "$_num_pgs"
}

Run Code Online (Sandbox Code Playgroud)

time使用前面的命令对其进行测试表明，strings前者非常慢，在 142 pg pdf 上花费约 0.200 秒pdftoppm，而后者非常快，在同一 pdf 上花费约 0.020 秒或更短时间。下面奥卡索的回答中的技术pdfinfo也非常快——与那个相同pdftoppm。

也可以看看

Ocaso Protal 的这些精彩答案。
pdf2searchablepdf上面的这些函数将在我的项目中使用：https: //github.com/ElectricRCAircraftGuy/PDF2SearchablePDF。

归档时间：	13 年前
查看次数：	10597 次
最近记录：	7 年，1 月前